Understand the concept of an ETL (Extract, Transform, Load) pipeline in Node.js.
By Mario Kandut
ETL is a process of extracting, transforming, and loading data from one or multiple sources into a destination. It is also an approach for data brokering. ETL stands for extract, transform, and load.
This is a common approach in moving data from one location to another, while transforming the structure of the data before it is loaded from its source into its destination.
💰 Start your cloud journey with $100 in free credits with DigitalOcean.
ETL is a process with three separate steps and often called a pipeline, because data moves through these three steps.
Steps in an ETL pipeline:
ETL solves the problem of having data in different places and disparate formats by allowing you to pull data from different sources into a centralized location with a standardized format. ETL pipelines are typically run as batch jobs. This means all the data is moved at once.
A common use case for an ETL pipeline is in Data Analytics, with the following steps:
Another use case would be to periodically move stored data to a new database in a different format than it is stored currently. Let's imagine you are a company with stores around the globe different, which make transactions in local currencies, and every store reports their revenue to the head office at the end of the month. You could use an ETL pipeline here, to better analyze the data from each store. First step would be to extract the data from the reports, then transform the different currency amounts into a single base currency, and finally load the modified report data to a reporting database.
An ETL pipeline is a practical choice for migrating large amounts of data, like converting hundreds of gigabytes of data stored in flat files into a new format, or compute new data based on those hundred of gigabytes. In general, ETL is a great fit for:
An ETL process can be computationally intensive, sometimes requires access to data that may not be available in real-time, and often it's a massive amount of data. Therefore, ETL processes are typically executed with a batch of data. This means that an ETL process is not working 24/7, and the actual state of the source data is lagging, sometimes minutes, though it could be days. The entire ETL pipeline takes time to extract, transform and load all the required data. The ETL pipeline usually runs on a schedule.
Thanks for reading and if you have any questions, use the comment function or send me a message @mariokandut.
Never miss an article.