Scheduling in Airflow

Last Updated : 23 Nov, 2022

When we are working on big projects in a team, we need workflow management tools that can keep track of activities and not get haywire in the sea of multiple tasks. Airflow is one of them(workflow management tool).

Airflow is an open-source platform and a workflow engine that will easily schedule monitor batch-oriented workflow, and run our complex data pipelines. In other words, we can call it a workflow orchestration tool, that is used in data transformation pipelines. For easy scheduling, Airflow uses Python to create workflows.

To understand Airflow Scheduling, we should be aware of Airflow DAG. Let’s discuss that in detail.

Airflow DAG:

DAG stands for Directed Acyclic Graph. It is a data pipeline in Airflow which is defined in Python. Here is a graph, each node represents a task (which is responsible for completing a unit of work) and each edge represents a dependency between tasks. Overall a DAG represents a collection of tasks. These tasks are organized in such a way that their relationships(between tasks in the Airflow UI) and dependencies are reflected. This DAG model provides a simple technique for executing the pipeline by dividing the pipelines into discrete incremental tasks(rather than depending on a single unit to perform all the work). Now let’s understand how DAG build pipelines(their mathematical properties).

Directed: In case of multiple tasks exist, then each task must have at least one defined upstream or downstream task.
Acyclic: We have never seen the creation of infinite loops due to inefficiency in code. In this, tasks avoid infinite loops and cannot have a dependency on themselves.
Graph: In a graph, each node represents a task (which is responsible for completing a unit of work) and each edge represents a dependency between tasks. All tasks can be visualized in a graph structure.

Airflow Scheduler: Scheduling Concepts and Terminology

In simple words, scheduling is the process of assigning resources to perform tasks. A scheduler has control over all tasks as to when and where various tasks will take place. An Airflow scheduler keeps track of all the tasks and DAGs and also set off tasks whose dependencies have been met.

In order to start the Airflow Scheduler service, all we need is one simple command:

airflow scheduler

Let’s understand what this command does. It starts the Airflow scheduler using the Airflow Scheduler configuration specified in airflow.cfg. After the start of the scheduler, our DAGs will automatically start executing based on start_date,schedule_interval, and end_date. These all parameters and scheduling concepts will be discussed in detail.

Airflow Scheduler Parameters:

data_interval_start: data_interval_start by default is created automatically by Airflow or by the user when creating a custom timetable. data_interval_start is a DateTime object that specifies the start date and time of the data interval. Whenever the DAG Run, this parameter is returned by the DAG’s timetable.
data_interval_end: data_interval_end by default is created automatically by Airflow or by the user when creating a custom timetable. data_interval_end is a DateTime object that specifies the end date and time of the data interval. Whenever the DAG Run, this parameter is returned by the DAG’s timetable.
start_date: This is the starting date when our first DAG run will be executed and is the required parameter for the Airflow Scheduler. Our DAG runs will be based on the min(start_date) for all our tasks. Now, all the creation of new DAG runs is done by Airflow Scheduler which is based on our schedule_interval.
end_date: end_date parameter is optional in Airflow Scheduling. This is the last date when our DAG will be executed.
schedule_interval: Whenever our DAG runs in the Airflow Scheduler, each of the runs has a “schedule_interval” or we can say interval sets in the repeat frequency. It is defined using a cron expression as an “str”, or a “datetime.timedelta” object.

Airflow Scheduler Terms:

It is important to be aware of all the Scheduling concepts before going deep with working on the Airflow Scheduler. So, there are some keys described below:

Data Interval: Data Interval is the property of Airflow 2.2 which represents a phase/period of data that each task should operate on. We can understand it with the help of an example: let’s schedule a DAG on a @hourly basis, each data interval begins at the top of the hour (minute 0) and ends at the close of the hour (minute 59).

  data_interval_start = Start date of the data interval = execution date.
  data_interval_end = End date of the data interval.

Logical Date: There is not much difference between the logical state and the start of the data interval (data_interval_start). Basically, it is a replacement for the original “execution_date”. It sometimes becomes confusing while executing multiple DAG runs.
Run After: The earliest time a DAG can be scheduled by the user is represented by the Run after. This Run after a date can be the same as the end of the data interval in our Airflow UI based on our DAG’s timeline.
Cron Presets: Whenever our DAG runs in the Airflow Scheduler, each of the runs has a repeated frequency. It is defined using a cron expression as an “str”, or a “datetime.timedelta” object. We have the option to specify the Interval as a cron expression or a cron preset. Further, we can pass them to the schedule_interval parameter and schedule our DAG runs. Some of the presets are None, @once, @hourly, @daily, @weekly, @monthly, and @yearly.
Timetable: The overall description of all the schedule intervals of our DAG is defined in the Airflow timetable. It also defines the steps taken when it is triggered manually or triggered by the scheduler. Airflow Timetables are essential “plugins” that are used by the Web Server and the Airflow Scheduler.
Catchup: As we know that Airflow Scheduler examines the lifetime of the DAG (from start to end/now, one interval at a time) by default, and executes a DAG run for any interval that has not been run (or has been cleared). This is called Catchup.
Backfill: The feature of the Airflow Scheduler that allows us to re-run DAGs on historical schedules manually is called Backfill. The basic work of this is the execution of all DAG runs that were scheduled between start_date & end_date (of your desired historical time frame). It is irrespective of the value of the catchup parameter in airflow.cfg.

Working On Airflow Scheduler:

Whenever we start the Airflow Scheduler service, the very first work of the scheduler is to check the “dags” folder and instantiates all DAG objects in the metadata databases. A metadata database is used for storing configurations, such as variables and connections, user information, roles, and policies. It acts as the source of truth for all metadata (regarding DAGs, schedule intervals, statistics from each run, and tasks).
Now, after checking all the dags folders, the scheduler parses it(DAG file) . Based on the scheduling parameters, it creates the necessary DAG runs.
A TaskInstance is created, for each Task in the DAG that has to be completed. In the metadata database, these TaskInstances are set to “Scheduled”.
After TaskInstances are set to Scheduled, all the tasks are searched. These are done by the primary scheduler which searches the database for all tasks that are in the “Scheduled” state. It passes them on to the executors (with the state changed to “Queued”).
As the state has been changed to Queued, we can pick up tasks from the queue. Now, we can start performing them, depending on the execution configuration. Once the task is completed and is removed from the queue, its state is changed from “Queued” to “Running.
Almost the work is done. As soon as we have completed our job, the tasks’ status is changed to the final state (finished, failed, etc.). Now, this update will be reflected in the Airflow Scheduler.

SCAN - EDF Scheduling in OS

alokrawat2121

Improve

Article Tags :

Scheduling in Airflow

Airflow DAG:

Airflow Scheduler: Scheduling Concepts and Terminology

Airflow Scheduler Parameters:

Airflow Scheduler Terms:

Working On Airflow Scheduler:

Similar Reads

Thank You!

What kind of Experience do you want to share?