Valentine Gogichashvili, the Head of Engineering at Zalando SE, presented his speech ‚Data is the new Oil‘ at the premiere of the Data Festival 2018 in Munich. He also emphasized the crucial importance of Data Engineers in Data Science projects.
The Data Festival is a platform for knowledge exchange, discussion and new insights into Data Science and Machine Learning
The Data Festival is hosted and organized by the Data Science Consulting company Alexander Thamm and the independent Analyst firm BARC and serves as a platform for the exchange and transfer of knowledge in the fields of Data Science and Machine Learning. Valentine Gogich
ashvili provided insights into the processing of data at Zalando What’s more, he emphasizes the importance of Data Engineers in the process of turning data into oil.
Terminology: Why Data is the new Oil
The first analogy of oil and data stems from the year 2006. Clive Humby, Chief Data Scientist and Executive Director at Starcount Ltd., compared data to oil in that both are unrefined resources that only gain value once they are being refined, respectively broken down and analyzed.
Another noteworthy citation regarding the terminology of Data as Oil was provided by the Italian Music journalist Piero Scaruffi in 2016:
„The difference between oil and data is that the product of oil does not generate more oil (unfortunately), whereas the product of data (self-driving cars, drones, wearables, etc.) will generate more data (where do you normally drive, how fast/well you drive, who is with you, etc.).”
So basically, Scaruffi noted that data has one significant advantage over oil: While oil is a natural resource with a limited supply, data is a resource that leads to the generation of products of data, which in turn will produce new data and therefore further increase the availability of data.
The Data Science Hierarchy of needs
Valentine Gogichashvili describes a model in his presentation: The Data Science Hierarchy of needs model. The shape of the pyramid indicates that needs, that are located at the bottom, are supposed to be fulfilled before needs and goals of higher layers can be achieved.
At the lowest layer, the main need is to collect data, for example by accessing external data or user generated content. In the next step, data should be both moved and stored, using options such as ETL, Pipelines and storage of structured and unstructured data. By cleaning and prepping data, the need of exploration and transformation can be satisfied.
Next, the need of labeling and aggregating requires analytics, metrics or training data. For learning and optimization, methods such as A/B Testing, simple Machine Learning Algorithms or experimentation can be applied. The goal of all the lower levels is to create the conditions for a successful implementation of AI and Deep Learning, that are both located at the top level. In many of these processes, the knowledge of Data Engineers is important.
Why we need more Data Engineers for success in harvesting the value of Data
Valentine Gogichashvili conducted interviews in his company to find out more about the satisfaction of Data Scientists. He found out that up to 80% of the Data Scientists‘ working time is actually spent on tasks of Data Engineers, which leads to a certain level of frustration among the Data Scientists. Further it turned out that accessing data as well as tracking it in its entirety are the main issues that Data Scientists are facing.
There are several logical consequences to be drawn from these insights:
First: Data Engineers are not Data Scientists.
Second: Data Engineers play a crucial role in Data Science.
Third: Unicorns (professionals that are both an expert in Data Science and Data
Engineering) are extremely rare.
So how can these two closely related professions be distinguished?
In order to explain the difference, Valentine Gogichashvili compares Data Engineers to plumbers: Plumbers take elements such as pipes (respectively technology) and connect them in order to conduct liquids (respectively data) through the system. Instead of using pipes, Data Engineers utilize technologies.
As these technologies are very sophisticated in the case of Data Engineers, becoming an expert requires a lot of training and a profound understanding of complex technologies. Hence, Data Engineers have to be very knowledgeable and specialized – they are needed for the development of Big Data platforms, the selection of adequate technologies or the realization of successful use cases. Therefore, they should be part of the successful execution of Data Science projects.
How Data is managed at Zalando SE
At Zalando, all teams are part of both data generation and data processing. A number of more technically affine teams is responsible for the consumption of this data.
Valentine Gogichashvili ensured that all teams have a core infrastructure and work autonomously. The autonomy is achieved by providing a Microservice Architecture. Unfortunately, at this point Valentine Gogichashvili faced the issue of data integration: Microservices do not allow Peer-to-Peer models. This problem led to the creation of the Nakadi Event Bus System. It is accessible via the Open Source platform gitHub.
So how does Event Bus work?
Required data can be sent via Event Bus by every Microservice. In the next step, the Data Lake and Infrastructure Teams extract the data in corporation with Data Engineers and add it to the Data Lake.
Then, the data is provided to the data-heavy teams: Business Intelligence (BI), Machine Learning (ML) and Data driven Decision Making (DDM). The DDM Team is responsible for the monitoring of core KPIs and the recommendation system.
Valentine‘s complete presentation about the role of Data and Data Engineers at Zalando is available here..