ETL: Extract Data with Node.js
© https://nodejs.org/en/

ETL: Extract Data with Node.js

Create the first step for an ETL pipeline in Node.

By Mario Kandut

Europe’s developer-focused job platform

Let companies apply to you

Developer-focused, salary and tech stack upfront.

Just one profile, no job applications!

This article is based on Node v16.14.0.

ETL is a process of extracting, transforming, and loading data from one or multiple sources into a destination. Have a look at the article ETL pipeline explained for a general overview on ETL pipelines.

💰 The Pragmatic Programmer: journey to mastery. 💰 One of the best books in software development, sold over 200,000 times.

This is the first article of a series about the extract phase in an ETL pipeline.

Extract Data in ETL pipeline

The first step in an ETL pipeline is to Extract the data, which we are going to Transform and Load in future steps. In the extract phase the decision is, which data sources to extract from and how exactly (API, authorization, DB, ...).

In the example, the jsonplaceholder.typicode.com will be used. It is a REST API, has a lot of options and is free for testing and prototyping. We are going to create two functions. One to extract the data, and the other one to orchestrate the different stages of the ETL pipeline.

In a real world example the data source would likely be a database, but for this example the placeholder API is fine. Since the same approach has to be followed, only the interface (directly connecting and querying or through middleware) to the database would be different.

Important in the Extract phase is:

  • Select the data source
  • Decide what data to extract
  • How to Query or retrieve the data (what method that is available, depends on the data source)

Let's start with creating the basics and then go through the steps:

    1. Query the data source

Create or add a project folder.

mkdir node-etl

Initialize project with npm init -y to be able to install node packages.

cd node-etl
npm init -y

Install node-fetch.

npm install node-fetch

1. Query the data source

Create an extract.js file, for handling querying the data.

touch extract.js

If you want to learn more about making API requests, check out How to make an API request in Node.

Add test code. We are going to query the photos for a specific album.

const fetch = require('node-fetch');

const URL = 'https://jsonplaceholder.typicode.com/albums';

async function getPhotos(albumId) {
  return await fetch(`${URL}/${albumId}/photos`).then(response =>
    response.json(),
  );
}

module.exports = { getPhotos };

The getPhotos() async function will take an albumId as an argument, will make a GET request to the fake API and return an array where each object has the following interface:

interface Photo {
  albumId: number;
  id: number;
  title: string;
  url: string;
  thumbnailUrl: string;
}

For the example, we will skip the validation of the input argument for albumId, which should not exceed a 100, or an empty array will be returned. We are exporting the function getPhotos so we can import it in the index.js, which will orchestrate the ETL pipeline. Have a look at the How to organize Node article for maintainable Node.js code.

2. Setup the pipeline

Create an index.js file, which will be the entry point of the application and used to orchestrate the ETL pipeline.

touch index.js

We are going to create an orchestrateEtlPipeline() function, which will coordinate each step in the ETL pipeline: extracting data, transforming it and loading it into its destination.

const { getPhotos } = require('./extract');

const orchestrateEtlPipeline = async () => {
  try {
    const photosAlbum = await getPhotos(1);
    console.log(photosAlbum);

    // TODO - TRANSFORM

    // TODO - LOAD
  } catch (error) {
    console.error(error);
  }
};

orchestrateEtlPipeline();

The orchestrateEtlPipeline() function uses async/await again in a try/catch block to handle any errors or promise rejections.

3. Wait for multiple requests

In the example code, we are only making one request, but what if we want to make more requests to the same data source or extract data from multiple sources. With the method Promise.all we can wait for all extracted data before proceeding. However, this method fires all requests at once, which can overwhelm some sources (think of multiple intensive requests to the DB). Read more about Multiple Promises in Node.js.

An example for requesting photos from multiple albums, would look something like this:

const { getPhotos } = require('./extract');

const orchestrateEtlPipeline = async () => {
  try {
    const allPhotos = Promise.all([
      getPhotos(1),
      getPhotos(2),
      getPhotos(3),
    ]);
    const [photos1, photos2, photos3] = await allPhotos;

    console.log(photos1[0], photos2[0], photos3[0]); // to log the first photo object of all three albums

    // TODO - TRANSFORM

    // TODO - LOAD
  } catch (error) {
    console.error(error);
  }
};

orchestrateEtlPipeline();

Once the data is extracted, the next step in the ETL pipeline is to transform it.

TL;DR

  • First step in an ETL pipeline is to extract the data from the data source.
  • Important in this phase is to Select the data source, to decide what data to extract and how to query or retrieve the data.

Thanks for reading and if you have any questions, use the comment function or send me a message @mariokandut.

If you want to know more about Node, have a look at these Node Tutorials.

References (and Big thanks):

HeyNode, MDN async/await

More node articles:

How to dynamically load ESM in CJS

How to convert a CJS module to an ESM

How to create a CJS module

How to stream to an HTTP response

How to handle binary data in Node.js?

How to use streams to ETL data?

How to connect streams with pipeline?

How to handle stream errors?

How to connect streams with pipe?

What Is a Node.js Stream?

Handling Errors in Node (asynchronous)

Handling Errors in Node.js (synchronous)

Introduction to errors in Node.js

Callback to promise-based functions

ETL: Load Data to Destination with Node.js

ETL: Transform Data with Node.js

ETL: Extract Data with Node.js

Event Emitters in Node.js

How to set up SSL locally with Node.js?

How to use async/await in Node.js

What is an API proxy?

How to make an API request in Node.js?

How does the Event Loop work in Node.js

How to wait for multiple Promises?

How to organize Node.js code

Understanding Promises in Node.js

How does the Node.js module system work?

Set up and test a .env file in Node

How to Use Environment Variables in Node

How to clean up node modules?

Restart a Node.js app automatically

How to update a Node dependency - NPM?

What are NPM scripts?

How to uninstall npm packages?

How to install npm packages?

How to create a package.json file?

What Is the Node.js ETL Pipeline?

What is data brokering in Node.js?

How to read and write JSON Files with Node.js?

What is package-lock.json?

How to install Node.js locally with nvm?

How to update Node.js?

How to check unused npm packages?

What is the Node.js fs module?

What is Semantic versioning?

The Basics of Package.json explained

How to patch an NPM dependency

What is NPM audit?

Beginner`s guide to NPM

Getting started with Node.js

Scroll to top ↑