One of the central challenges when providing a Data Lake architecture in a multi-account Cloud setup on AWS consists in offering a consistent, familiar, and easy-to-use access mechanism to it which ideally hides all the nitty-gritty details of varying data formats and data sources, while at the same time making the data available for analysis in a convenient way. A proven abstraction that meets these requirements is a set of relations (tables) that enable analyses using the well-established Structured Query Language (SQL). In order to to accomplish this, a metastore is necessary that will manage all the table data in one central place. This talk will be about how we build such a central metastore which can be accessed by an arbitrary number of Elastic MapReduce clusters, spread over many accounts belonging to the company and running different kinds of analysis workloads. We will go into technical details of the implementation and also show the benefits of such a solution from the perspective of different stake holders.
Raffael Dzikowski, Big Data Engineer, Scout24 Group