Content and procedure

More and more projects are using Apache Spark to implement their data pipelines. Thanks to its high-level APIs and the automated execution of calculations on computing clusters, Spark has greatly simplified the processing of large amounts of data. However, distributed execution presents developers with new challenges when implementing data pipelines in Apache Spark.

This workshop provides the background knowledge and best practices necessary to implement data pipelines with Apache Spark. Starting with an introduction to data pipelines, the participants will learn how to use Sparks DataFrame API and gain an insight into how the Spark Engine works. The workshop ends with a presentation of best practices to avoid common mistakes when developing data pipelines with Apache Spark.

The knowledge imparted is deepened within hands-on sessions. Participants are provided with an interactive savings environment in the Databricks Cloud, in which the exercises are processed in small groups. The savings notebooks used in the workshop will then be made available to the participants, including sample solutions. Due to its ease of use for beginners, we use the Python programming language for code examples and in hands-on sessions.

Simon-Kaltenbacher

Short Facts

  • Trainer: Simon Kaltenbacher
  • Language: English
  • 16th of April 2018
  • 10:00 – 17:15
  • Data Hub, Sapporobogen 6-8, 80637 München

Educational goals

Formulation of data pipelines with Spark's DataFrame API

Advantages and disadvantages of Apache Spark compared to other technologies on the market

Best practices for implementing data pipelines with Apache Spark

Trainer

Simon Kaltenbacher

is Head of Technology at Alexander Thamm GmbH. There he advises customers on establishing data platforms and supports them in implementing data pipelines. He has been following the Apache-Spark project intensively since version 0.9 and has already held several trainings and lectures on this technology.

Agenda

federica-galli-449563

10:00 – 10:30: Challenges and general techniques of data pipelines

10:30 – 11:15: Introduction to the Apache Spark project and its DataFrame API

11:15 – 11:30: Coffee break

11:30 – 13:00: Hands-On implementation of data pipelines with Spark’s DataFrame API

13:00 – 14:00: Lunch break

14:00 – 14:30: How the Spark Engine works

14:30 – 15:15: Proven techniques

15:15 – 15:30: Coffee break

15:30 – 17:15: Hands-On Proven Techniques

Requirements

  • Participants should have basic knowledge of the Python programming language.
  • First experiences with Apache Spark are advantageous, but not mandatory.
  • Each participant must have their own computer. Every computer must be able to access the Internet via W-Lan. The latest version of Firefox or Chrome browser should be installed.

Are you interested in this workshop?