Challenges in dealing with external data

External data are gaining importance

The use of external data is quite common. A survey within the framework of the BARC Advanced & Predictive Analytics User Study in 2017 showed that, in addition to using internal data, two-thirds of the companies also use external data sources for analytics. Around 30 percent of the companies even buy external data.

Particularly relevant types of external data are weather data, spatial data, social media data, webblog data and demographic data. For these data types there are a variety of sources, interesting use cases as well as various technical solutions to make working with these data easier.

Challenges in dealing with external data

If an appropriate use case and corresponding data types have been found, it is essential to discover how this data is to be procured, processed, analyzed and integrated into the company’s data household. So the most important challenges are the identification of relevant sources, the technical integration, storage, integration and analysis.

Relevant data sources are, e.g., open data. Open data are data that are available free of charge and their use is not subject to restrictions. These include above all demographic and spatial data provided by local governments, countries and universities. Another data source are data provided by companies, which although proprietary, are released under certain conditions for free for use by third parties. Here it must be considered that the free use of the data is subject to usage restrictions. Data markets offer a non-gratuitous use of external data. Examples are Quandl or Qlik Datamarket.

Once the necessary external data have been identified, they must be accessed and integrated technically. Access to certain public data sources is integrated into some software packages. Data integration tools, such as those by Talend or Informatica offer connectors to social-media sources, the advanced analytics platform RapidMiner contains a connector to a linked open data project and Microsoft Azure ML is highly integrated with its own market place. Oracle also provides a multitude of data via its Oracle data cloud.

The storage of external data can be made more difficult by the fact that data is polystructured, reaches a large volume or is subject to short update cycles. What complicates things is that for analyses not only the current (data) state is relevant but rather a history of this data is often required. Most data providers have recognized this challenge and also offer companies data storage services.

In order to be able to analyze data from various sources, a homogeneous data stream must be created from heterogeneous data sources. On the one hand this means that different, heterogeneous external data sources must be integrated and on the other that internal data must be enriched and matched with external data. Several companies offer support here as well, by having completed this integration already. Depending on the company and application, the focus is on different data types.

In order to draw findings from external data for analysis purposes these data must be analyzed in conjunction with internal data. Alongside visual analyses data mining methods are used to identify customer clusters, influence variables and purchase probabilities or to predict volumes. For this, the functions offered by an advanced analytics platform can be used. Beyond that there are a variety of software solutions that offer ready-made analyses for specific use cases.