Easily build ETL Pipeline using Python and Airflow

Reading Time: 5 minutes

Apache Airflow is an open-source workflow management platform for authoring, scheduling, and monitoring workflows or data pipelines programmatically. Python is used to write Airflow, and Python scripts are used to create workflows. It was created by Airbnb. In this blog, we will show how to configure airflow on our machine as well as write a Python script for extracting, transforming, and loading (ETL) data and running the data pipeline that we have built.

In this blog, we will cover:

  • What is a Workflow?
  • Hands-on
  • Conclusion

What is a Workflow?

  • A sequence of tasks that are started or scheduled or triggered by an event.
  • Frequently used to handle big data processing pipelines

DAG for airflow Airflow employs a workflow as a Directed Acyclic Graph (DAG) in which multiple tasks can be executed independently.

An example of DAG:

Easily build ETL Pipeline using Python and Airflow

Hands-on

Things we will do to create our first Airflow ETL pipeline in this blog:

  • Setting up the Airflow and VS code editor. 
  • Download the dummy cat details data from an API. 
  • Transform the data 
  • Load the data to a CSV file

Setting up the Airflow and VS code editor

  • Install the Virtual Box VM to run the Ubuntu Server. Also, download the Ubuntu Server ISO file before moving forward.
  • Open the Oracle VM Virtual box manager and click on the ‘new’ button.
  • Write the name of your Ubuntu server and click on next in the next few steps to complete the setup. Go with the recommended choices.
Easily build ETL Pipeline using Python and Airflow
  • Once you have completed the setup, you will see the virtual machine in this manner:
Easily build ETL Pipeline using Python and Airflow
  • Now let us configure the network, so go to Settings for the created machine and go to Network. Under Advanced, click on the Port Forwarding button:

In this, add the two ports as shown below:

Easily build ETL Pipeline using Python and Airflow

Guest 8080 to Host 8250 will be used for Airflow UI

Guest 22 to Host 2222 for SSH connection

  • Run the virtual machine and install the OS
  • The installation window will look something similar to this:
Easily build ETL Pipeline using Python and Airflow
  • Now, create a user, here it is “airflow”. ‘sudo adduser airflow’ and type in the password
  • Establish SSH connection

In the windows machine, open the terminal and make a connection by typing

ssh -p 2222 airflow@localhost

  • Install Python and Airflow on a Virtual Linux machine

To Install python, run the command:

sudo apt install python3-pip

Following that, install the Airflow by the following command:

sudo pip3 install apache-airflow

  • Initialize the database for Airflow:

airflow db init

  • Create a user in airflow

airflow users create -u admin — f first_name — l last_name> — role Admin — e your_email

  • Start the airflow web server

airflow webserver -D

After this step, our Apache webserver is now running. Open the web browser and go to localhost:8250 and log in. You would see the DAGs in the snapshot below.

  • Configuring VS Code editor
    • We would use the VS Code editor to write our python scripts.
    • Install the plugin Remote — SSH and connect to the host by typing ssh -p 2222 airflow@localhost
    • Add the connection configuration to SSH configurations, so select the first option
    • When prompted for password, enter the same password you have set for Ubuntu airflow user.
    • Now open the /home/airflow folder

After setup, you will see:

Easily build ETL Pipeline using Python and Airflow

And that’s it — you’re ready to create your first Airflow DAG. Make sure to put them inside the dags folder (you’ll have to create it first), as that’s where Airflow will try to find them

Coding the Pipeline

We would write a python script for extracting, transforming, and loading (ETL) data and running the data pipeline that we have created.

Create a Python file in dags/cats_pipeline.py.

Extraction

In this example, we would be extracting the data about some of the facts about cats from catfact.ninja API. So we would use the requests library and get the JSON response.

xcom_push will save the fetched results to Airflow’s database.

Transformation

Here, we have created a dummy function to do xcom_pull which will get the data and transform it to our requirements.

Easily build ETL Pipeline using Python and Airflow

Loading

We shall do the xcom_pull to get the transformed cats data and save it to a CSV file.

Easily build ETL Pipeline using Python and Airflow

Directed Acyclic Graph using Python

This code will help us to arrange the extract, transform and load tasks in a DAG workflow format and help us to create a pipeline. Apache Airflow is based on the idea of DAGs (Directed Acyclic Graphs). This means we’ll have to specify tasks for pieces of our pipeline and then arrange them somehow. We would use PythonOperator based tasks.

Code for DAG using PythonOperator:

Easily build ETL Pipeline using Python and Airflow

Run the Pipeline

Now, go to the Airflow UI On the local host and you would see the DAG etl_cats_data.

Initially, the pipeline is not running so we manually would have to trigger the DAG. Switch on the pipeline and click on the Play button. You can now see the pipeline running.

After the pipeline is completed, we can now check whether the data has been loaded or not.

The data load was a success.

Conclusion

Apache Airflow simplifies the creation of data pipelines while also optimizing management and scheduling tasks. It is widely used in the software industry for orchestrating both ETL (Extract Load Transform) and ELT (Extract Load Transform) for data warehouse applications. You will be able to create your first data pipeline using this blog, and you can also use it as a template/format to create multiple pipelines based on your needs. We will come up with more such use cases in our upcoming blogs.

Meanwhile…

If you are an aspiring Python developer and want to explore more about the above topics, here are a few of our blogs for your reference:

Stay tuned to get all the updates about our upcoming blogs on the cloud and the latest technologies.

Keep Exploring -> Keep Learning -> Keep Mastering 
At Workfall, we strive to provide the best tech and pay opportunities to kickass coders around the world. If you’re looking to work with global clients, build cutting-edge products and make big bucks doing so, give it a shot at workfall.com/partner today!

Back To Top