Transparent Analytics

What is Data Integration?

Data integration combines data from disparate sources, provides the user with a unified view and prepares the data for further analysis. The abbreviation 'ETL', short for extract, transform and load, represents the main tasks of most data integration problems:

  1. Extract the relevant data from one or several data sources.
  2. Transform the data – clean, patch, compute derived or aggregate data, link data sources, enrich data, etc.
  3. Load the transformed data into a system for its intended use – this may be a simple database, a fancy web interface, or a bunch of Excel files.

ETL is a hard problem. There is no single best toolset for implementing an ETL solution, primarily because there is so much variation in data sources, data quality and intended use cases.

Why should I care?

Does your organization have several IT systems, with no possibility to query data across all of them? Would you like to merge data from a range of public data sources to do some interesting research? Do you have to run 17 scripts in just the right order to import all your data, and would like to have a more structured implementation of your data processing pipeline? Do you have data lying around in hard-to-use formats and would like a nicely structured representation?

These are a few situations in which an ETL solution, as described on this page, may be useful.

Ozelot to the rescue

This page is about building maintainable data integration pipelines for everyday projects. The proposed setup is embodied by Ozelot, a Python library for ETL. It is based on Luigi for pipeline management and SQLAlchemy for the ORM layer.

Implementing a proper data integration pipeline and a well-defined data model helps document your data flows and makes them traceable. More importantly, it simplifies the ETL development process, because it lets you easily re-run the whole process or parts of it. And you will have to modify and re-run your ETL, because your code changes, your output requirements change or the data changes.

Ozelot gives you core functionality to quickly start building your own solution, including an ORM base class, database connection management and Luigi task classes that play nice with the ORM. It comes with extensively documented examples that walk you through various aspects of data integration.

The proposed setup works well for many small- to medium-sized projects. For big-data projects or those requiring live streaming data you probably want to consider alternative solutions.

Ozelot is hosted on GitHub. The documentation is on readthedocs.

A scripted, open-source approach

The Ozelot ETL solution is fully ‘scripted’ - it is built entirely in Python source code. A scripted approach provides the benefits of

  • Having your complete data processing workflow represented in code, making it reproducible and providing full ‘documentation’ of the process (if your code is readable...);
  • Providing a fully automated solution (you will run your process many times);
  • Enabling the re-use of functionality across the project, avoiding building common funtionality several times;
  • Giving you unbeatable flexibility and functionality, by leveraging the complete Python package ecosystem;
  • Being able to use standard tools like git for revision control and collaborative development, which is not easily possible for GUI-based tools;
  • Enabling reasonable debugging using standard tools.

Furthermore, the Ozelot ETL stack uses only open-source libraries. This has the advantage of

  • Providing full transparency to both your ETL solution and the underlying libraries;
  • Allowing you to use your solution in any way you care to imagine; and
  • Avoiding license fees.

Core principles of data integration

Taking one step back, I propose the following core principles for maintainable data integration:

  1. Any and all data manipulation happens in the pipeline, in a single code base.
  2. The pipeline represents all dependencies between data integration tasks.
  3. Each task has a method for rolling back its operations.
  4. Data is loaded into a single database, in a clearly defined, object-based data model that also encodes object relationships.
  5. The whole process is fully automatic and thereby reproducible and traceable.