Content and procedure
More and more projects are using Apache Spark to implement their data pipelines. Thanks to its high-level APIs and the automated execution of calculations on computing clusters, Spark has greatly simplified the processing of large amounts of data. However, distributed execution presents developers with new challenges when implementing data pipelines in Apache Spark.
This workshop provides the background knowledge and best practices necessary to implement data pipelines with Apache Spark. Starting with an introduction to data pipelines, the participants will learn how to use Sparks DataFrame API and gain an insight into how the Spark Engine works. The workshop ends with a presentation of best practices to avoid common mistakes when developing data pipelines with Apache Spark.
The knowledge imparted is deepened within hands-on sessions. Participants are provided with an interactive savings environment in the Databricks Cloud, in which the exercises are processed in small groups. The savings notebooks used in the workshop will then be made available to the participants, including sample solutions. Due to its ease of use for beginners, we use the Python programming language for code examples and in hands-on sessions.
- Trainer: Simon Kaltenbacher
- Language: English
- 16th of April 2018
- 10:00 – 17:15
- Data Hub, Sapporobogen 6-8, 80637 München
Formulation of data pipelines with Spark's DataFrame API
Advantages and disadvantages of Apache Spark compared to other technologies on the market
Best practices for implementing data pipelines with Apache Spark
is Head of Technology at Alexander Thamm GmbH. There he advises customers on establishing data platforms and supports them in implementing data pipelines. He has been following the Apache-Spark project intensively since version 0.9 and has already held several trainings and lectures on this technology.
10:00 – 10:30: Challenges and general techniques of data pipelines
10:30 – 11:15: Introduction to the Apache Spark project and its DataFrame API
11:15 – 11:30: Coffee break
11:30 – 13:00: Hands-On implementation of data pipelines with Spark’s DataFrame API
13:00 – 14:00: Lunch break
14:00 – 14:30: How the Spark Engine works
14:30 – 15:15: Proven techniques
15:15 – 15:30: Coffee break
15:30 – 17:15: Hands-On Proven Techniques
- Participants should have basic knowledge of the Python programming language.
- First experiences with Apache Spark are advantageous, but not mandatory.
- Each participant must have their own computer. Every computer must be able to access the Internet via W-Lan. The latest version of Firefox or Chrome browser should be installed.