How to Create First DAG in Airflow?
Last Updated :
04 Jan, 2025
Directed Acyclic Graph (DAG) is a group of all individual tasks that we run in an ordered fashion. In other words, we can say that a DAG is a data pipeline in airflow. In a DAG:
- There is no loop
- Edges are directed
Key Terminologies:
- Operator: The task in your DAG is called an operator. In airflow, the nodes of the DAG can be called an operator
- Dependencies: The specified relationships between your operators are known as dependencies. In airflow, the directed edges of the DAG can be called dependencies.
- Tasks: Tasks are units of work in Airflow. Each task can be an operator, a sensor, or a hook.
- Task Instances: It is a run of a task at a point in time. These are runnable entities. Task Instances belong to a DagRun.
A Dag file is a python file that specifies the structure as well as the code of the DAG.
Steps To Create an Airflow DAG
- Importing the right modules for your DAG
- Create default arguments for the DAG
- Creating a DAG Object
- Creating tasks
- Setting up dependencies for the DAG
Now, let’s discuss these steps one by one in detail and create a simple DAG.
Step 1: Importing the right modules for your DAG
In order to create a DAG, it is very important to import the right modules that are needed in order to make sure, that we have imported all the modules, that we will be using in our code to create the structure of the DAG. The first and most important module to import is the “DAG” module from the airflow package that will initiate the DAG object for us. Then, we can import the modules related to the date and time. After that we can import the operators, we will be using in our DAG file. Here, we will be just importing the Dummy Operator.
# To initiate the DAG Object
from airflow import DAG
# Importing datetime and timedelta
modules for scheduling the DAGs
from datetime import timedelta, datetime
# Importing operators
from airflow.operators.dummy_operator
import DummyOperator
Step 2: Create default arguments for the DAG
Default arguments is a dictionary that we pass to airflow object, it contains the metadata of the DAG. We can easily apply these arguments to as many operators, that we want.
Let’s create a dictionary named default_args
# Initiating the default_args
default_args = {
'owner' : 'airflow',
'start_date' : datetime(2022, 11, 12)
}
- the owner can be the owner of the DAG
- start_date is the date DAG starts getting scheduled
We can add more such parameters to our arguments, as per our requirement.
Step 3: Creating DAG Object
After the default_args, we have to create a DAG object, by passing a unique identifier, that we call “dag_id“, Here we can name it DAG-1.
So, let’s create a DAG Object.
# Creating DAG Object
dag = DAG(dag_id='DAG-1',
default_args=default_args,
schedule_interval='@once',
catchup=False
)
Here,
- dag_id is the unique identifier for the DAG.
- schedule_interval is the time, how frequently our DAG will be triggered. It can be once, hourly, daily, weekly, monthly, or yearly. None means that we do not want to schedule our DAG and can trigger it manually.
- catchup – If we want to start executing the task from the current task, then we have to specify the catchup to be False. By default, catchup is True, which means that airflow will start running the tasks for all past intervals up to the current interval by default.
Step 4: Create tasks
A task is an instance of an operator. It has a unique identifier called task_id. There are various operators, but here, we will be using the DummyOperator. We can create various tasks using various operators. Here we will be creating two simple tasks:-
# Creating first task
start = DummyOperator(task_id
= 'start', dag = dag)
If you go to the graph view in UI, then you can see the task, “start” has been created.
# Creating second task
end = DummyOperator(task_id
= 'end', dag = dag)
Now, two tasks start and end will be created,
Step 5: Setting up dependencies for the DAG.
Dependencies are the relationship between the operators or the order in which the tasks in a DAG will be executed. We can set the order of execution by using the bitwise left or right operators to specify the downstream or upstream fashion respectively.
- a >> b means that first, a will run, and then b will run. It can also be written as a.set_downstream(b).
- a << b means that first, b will run which will be followed by a. It can also be written as a.set_upstream(b).
Now, let’s set up the order of execution between the start and end tasks. Here, let us suppose that we want to start to run first, and end running after that.
# Setting up dependencies
start >> end
# We can also write it as start.set_downstream(end)
Now, start and end after setting up dependencies:-
Putting all our code together,
# Step 1: Importing Modules
# To initiate the DAG Object
from airflow import DAG
# Importing datetime and timedelta modules for scheduling the DAGs
from datetime import timedelta, datetime
# Importing operators
from airflow.operators.dummy_operator import DummyOperator
# Step 2: Initiating the default_args
default_args = {
'owner' : 'airflow',
'start_date' : datetime(2022, 11, 12),
}
# Step 3: Creating DAG Object
dag = DAG(dag_id='DAG-1',
default_args=default_args,
schedule_interval='@once',
catchup=False
)
# Step 4: Creating task
# Creating first task
start = DummyOperator(task_id = 'start', dag = dag)
# Creating second task
end = DummyOperator(task_id = 'end', dag = dag)
# Step 5: Setting up dependencies
start >> end
Now, we have successfully created our first dag. We can move on to the webserver to see it in the UI.
Now, you can click on the dag and can explore different views of the DAG in the Airflow UI.
Similar Reads
How to create a dataset using PyBrain?
In this article, we are going to see how to create a dataset using PyBrain. Dataset Datasets are the data that are specifically given to test, validate and train on networks. Instead of troubling with arrays, PyBrain provides us with a more flexible data structure using which handling data can be qu
3 min read
How to Install Apache Airflow?
A batch-oriented workflow can be developed, scheduled, and monitored using Apache Airflow, an open-source platform. You can integrate Airflow with virtually any technology thanks to its Python extension framework. Workflows can be managed using a web interface. Airflow is deployable in many ways, fr
2 min read
Power BI - How to Create a Dashboard?
Dashboards are static, in nature, and are used to just view the final reports. Dashboards are only available, in Power BI online service, and not in the Desktop mode. In this article, we will learn how to create a dashboard, in a workspace. Creating a Dashboard In a Workspace in Power BI We have a w
2 min read
What is Apache Airflow?
Apache Airflow is an open-source tool to programmatically author, schedule, and monitor workflows. It is used by Data Engineers for orchestrating workflows or pipelines. One can easily visualize your data pipelines' dependencies, progress, logs, code, trigger tasks, and success status. Complex data
3 min read
How To Create a Free Tier Account on GCP?
Like rivals Amazon Web Services (AWS) and Microsoft Azure, GCP is a provider of public clouds. Customers can use Google's global data center-based computing resources for free or pay-per-use through GCP and other cloud suppliers. GCP offers a comprehensive set of computing services that include ever
1 min read
How to Calculate Autocorrelation in R?
In this article, we will calculate autocorrelation in R programming language Autocorrelation is used to measure the degree of similarity between a time series and a lagged version of itself over the given range of time intervals. We can also call autocorrelation as âserial correlationâ or âlagged co
2 min read
How to Install Apache Airflow in Kaggle
Apache Airflow is a popular open-source tool used to arrange workflows and manage ETL (Extract, Transform, Load) pipelines. Installing Apache Airflow in a Kaggle notebook allows users to perform complex data processing tasks within the Kaggle environment, leveraging the flexibility of DAGs (Directed
4 min read
Create a Kinesis Data Stream
Pre-requisite: AWS Kinesis A Kinesis data stream is a collection of data records that are continually and sequentially added to over time, like a stream. It is a real-time data streaming service offered by Amazon Web Services (AWS) that enables us to collect, process, and analyze large streams of da
2 min read
How To Create Azure Data Factory Pipeline Using Terraform ?
In todayâs data-driven economy, organizations process and convert data for analytics, reporting, and decision-making using efficient data pipelines. One powerful cloud-based tool for designing, organizing, and overseeing data integration procedures is Azure Data Factory (ADF), available from Microso
6 min read
How to Enable MFA in AWS?
In today's digital landscape, ensuring the security of your cloud infrastructure is paramount. One effective way to bolster security is by implementing Multi-Factor Authentication (MFA) in AWS. Multi-factor authentication adds an additional layer of protection by requiring users to provide two or mo
5 min read