vsupalov


What Is the Definition of ETL and How Does It Differ From Data Pipelines?

A concise explanation and overview.

November 6, 2015

ETL is an acronym, and stands for three data processing steps: Extract, Transform and Load. ETL tools and frameworks are meant to do basic data plumbing: ingest data from many sources, perform some basic operations on it and finally save it to a final target datastore (usually a database or a data warehouse). The term itself has quite some history, and similar to BI, had time to accumulate all kinds of flavors of meanings, depending on who you ask.

A data pipeline, encompasses the complete journey of data inside a company. It can contain various ETL jobs, more elaborate data processing steps and while ETL tends to describe batch-oriented data processing strategies, a data pipeline can contain near-realtime streaming components. Current state-of-the-art tools for building data pipelines are for example Luigi (published by Spotify) or Airflow (published by AirBnB). Both fall into the category of workflow engines, are used in production by many mature companies to orchestrate very impressive data processing pipelines.

If you only want to get your data from one database to another, ETL is the term you should look for. Don’t bake your own solution, but rather use an existing tool or framework - otherwise it will get very messy eventually, and it will probably be hard to maintain by anybody but yourself. If however your team is going to grow and the ways you are using your data are not limited to plumbing it from one database into another, do take a look at the aforementioned workflow engines. They will require a bit of finesse, effort and DevOps skill to set up properly, but won’t hold you back once the company and its data start growing.

Want to be notified when I publish new content?

Just enter your name and email below. You will also get content that I share exclusively with the list, and zero spam!