ETL: Transform Data with Node.js
© https://nodejs.org/en/

ETL: Transform Data with Node.js

Create the second phase for an ETL pipeline in Node.

ByMario Kandut

honey pot logo

Europe’s developer-focused job platform

Let companies apply to you

Developer-focused, salary and tech stack upfront.

Just one profile, no job applications!

This article is based on Node v16.14.0.

ETL is a process of extracting, transforming, and loading data from one or multiple sources into a destination. Have a look at the article ETL pipeline explained for a general overview on ETL pipelines.

💰 The Pragmatic Programmer: journey to mastery. 💰 One of the best books in software development, sold over 200,000 times.

This is the second article of a series of three articles, and it tries to explain the transform phase in an ETL pipeline.

Transform Data in ETL pipeline

The second phase in an ETL pipeline is to Transform the extracted data. The data can be completely reformatted in this phase, like renaming fields, adding new fields, filter data out, etc. The transform phase in an ETL pipeline is responsible for transforming the data in the desired format for its destination. In this step you can clean data, standardize values and fields, and aggregate values.

We are going to continue with the example used in the article ETL: Extract Data with Node.js.

1. Determine the new structure of the data

The first step in the Transform phase should be to determine what the new data structure should be. In the example we are extracting photo albums which are an array of photo objects. For the transformation, not needed data thumbnailUrl should be removed, a new property name with value Mario (or whatever string value you like) should be added to the photo object. As well, as a timestamp with the current time should be added to the array of photo albums.

Old photo objects interface:

interface Photo {
  albumId: number;
  id: number;
  title: string;
  url: string;
  thumbnailUrl: string;
}

Interface photo object transformed:

interface Photo {
  albumId: number;
  id: number;
  name: string;
  title: string;
  url: string;
}

The interface for the photo albums is currently an array with the photo object:

Array<Photo>

New Interface for photoAlbums:

interface PhotoAlbums {
  timestamp: Date;
  data: Array<Photo>;
}

2. Create a transform function

Create another file transform.js in the project folder, which is going to contain the transform functions.

touch transform.js

Create a transform function for transforming the photo object. It takes a photo object as an input, returns the needed properties and adds the name property with a string value.

function transformPhoto(photo) {
  return {
    albumId: photo.albumId,
    id: photo.id,
    name: 'Mario',
    title: photo.title,
    url: photo.url,
  };
}

module.exports = { transformPhoto };

A second function has to be created for transforming the photoAlbum, a timestamp with current time should be added and the array with photos should be moved into the new property data.

function addTimeStamp(photoAlbum) {
  return {
    data: photoAlbum,
    timeStamp: new Date(),
  };
}

module.exports = { transformPhoto, addTimeStamp };

3. Add transform phase in ETL orchestrate function

We are going to use the example with multiple requests for getting photos, since one request is boring. 😀 Now, we have to require both functions in the orchestrateEtlPipeline() in index.js and After the request is done, we map over each photo object in each photoAlbum to apply the transformation with the transformPhoto() function. Then we output the result.

const { getPhotos } = require('./extract');
const { addTimeStamp, transformPhoto } = require('./transform');

const orchestrateEtlPipeline = async () => {
  try {
    // EXTRACT
    const allPhotoAlbums = Promise.all([
      getPhotos(1),
      getPhotos(2),
      getPhotos(3),
    ]);
    const [
      photoAlbum1,
      photoAlbum2,
      photoAlbum3,
    ] = await allPhotoAlbums;

    // TRANSFORM
    let transformedPhotoAlbum1 = photoAlbum1.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum2 = photoAlbum2.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum3 = photoAlbum3.map(photo =>
      transformPhoto(photo),
    );

    console.log(
      transformedPhotoAlbum1[0],
      transformedPhotoAlbum2[0],
      transformedPhotoAlbum3[0],
    ); // log first photo object of each transformed photoAlbum

    // TODO - LOAD
  } catch (error) {
    console.error(error);
  }
};

orchestrateEtlPipeline();

The transformation of the photo object is complete, and the output should just contain the five properties albumId, id, name, title and url, the thumbnailUrl property should be removed. Now we have to transform the photoAlbum and add the timeStamp. We also output the timestamp.

const { getPhotos } = require('./extract');
const { addTimeStamp, transformPhoto } = require('./transform');

const orchestrateEtlPipeline = async () => {
  try {
    // EXTRACT
    const allPhotoAlbums = Promise.all([
      getPhotos(1),
      getPhotos(2),
      getPhotos(3),
    ]);
    const [
      photoAlbum1,
      photoAlbum2,
      photoAlbum3,
    ] = await allPhotoAlbums;

    // TRANSFORM
    let transformedPhotoAlbum1 = photoAlbum1.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum2 = photoAlbum2.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum3 = photoAlbum3.map(photo =>
      transformPhoto(photo),
    );

    console.log(
      transformedPhotoAlbum1[0],
      transformedPhotoAlbum2[0],
      transformedPhotoAlbum3[0],
    ); // log first photo object of each transformed photoAlbum

    transformedPhotoAlbum1 = addTimeStamp(transformedPhotoAlbum1);
    transformedPhotoAlbum2 = addTimeStamp(transformedPhotoAlbum2);
    transformedPhotoAlbum3 = addTimeStamp(transformedPhotoAlbum3);

    console.log(
      transformedPhotoAlbum1.timeStamp,
      transformedPhotoAlbum2.timeStamp,
      transformedPhotoAlbum3.timeStamp,
    ); // log timestamp
    console.log(transformedPhotoAlbum1);

    // TODO - LOAD
  } catch (error) {
    console.error(error);
  }
};

orchestrateEtlPipeline();

After the last step is finished, we are ready for the next phase of the ETL pipeline Load, which handles loading the transformed data in its destination.

TL;DR

  • Second phase in an ETL pipeline is to transform the data.
  • The first step in the transform phase is to determine what the new data structure should be.
  • The second step is to transform the data in the desired format.

Thanks for reading and if you have any questions, use the comment function or send me a message @mariokandut.

If you want to know more about Node, have a look at these Node Tutorials.

References (and Big thanks):

HeyNode, OsioLabs, MDN async/await

More node articles:

Getting started with Webpack

How to list/debug npm packages?

How to specify a Node.js version

How to create a web server in Node.js

How to dynamically load ESM in CJS

How to convert a CJS module to an ESM

How to create a CJS module

How to stream to an HTTP response

How to handle binary data in Node.js?

How to use streams to ETL data?

How to connect streams with pipeline?

How to handle stream errors?

How to connect streams with pipe?

What Is a Node.js Stream?

Handling Errors in Node (asynchronous)

Handling Errors in Node.js (synchronous)

Introduction to errors in Node.js

Callback to promise-based functions

ETL: Load Data to Destination with Node.js

ETL: Transform Data with Node.js

ETL: Extract Data with Node.js

Event Emitters in Node.js

How to set up SSL locally with Node.js?

How to use async/await in Node.js

What is an API proxy?

How to make an API request in Node.js?

How does the Event Loop work in Node.js

How to wait for multiple Promises?

How to organize Node.js code

Understanding Promises in Node.js

How does the Node.js module system work?

Set up and test a .env file in Node

How to Use Environment Variables in Node

How to clean up node modules?

Restart a Node.js app automatically

How to update a Node dependency - NPM?

What are NPM scripts?

How to uninstall npm packages?

How to install npm packages?

How to create a package.json file?

What Is the Node.js ETL Pipeline?

What is data brokering in Node.js?

How to read and write JSON Files with Node.js?

What is package-lock.json?

How to install Node.js locally with nvm?

How to update Node.js?

How to check unused npm packages?

What is the Node.js fs module?

What is Semantic versioning?

The Basics of Package.json explained

How to patch an NPM dependency

What is NPM audit?

Beginner`s guide to NPM

Getting started with Node.js

Scroll to top ↑