Pre-requisite: Airflow
When we are working on big projects in a team, we need workflow management tools that can keep track of activities and not get haywire in the sea of multiple tasks. Airflow is one of them(workflow management tool).
Airflow is an open-source platform and a workflow engine that will easily schedule monitor batch-oriented workflow, and run our complex data pipelines. In other words, we can call it a workflow orchestration tool, that is used in data transformation pipelines. For easy scheduling, Airflow uses Python to create workflows.
To understand Airflow Scheduling, we should be aware of Airflow DAG. Let’s discuss that in detail.
Airflow DAG:
DAG stands for Directed Acyclic Graph. It is a data pipeline in Airflow which is defined in Python. Here is a graph, each node represents a task (which is responsible for completing a unit of work) and each edge represents a dependency between tasks. Overall a DAG represents a collection of tasks. These tasks are organized in such a way that their relationships(between tasks in the Airflow UI) and dependencies are reflected. This DAG model provides a simple technique for executing the pipeline by dividing the pipelines into discrete incremental tasks(rather than depending on a single unit to perform all the work). Now let’s understand how DAG build pipelines(their mathematical properties).
- Directed: In case of multiple tasks exist, then each task must have at least one defined upstream or downstream task.
- Acyclic: We have never seen the creation of infinite loops due to inefficiency in code. In this, tasks avoid infinite loops and cannot have a dependency on themselves.
- Graph: In a graph, each node represents a task (which is responsible for completing a unit of work) and each edge represents a dependency between tasks. All tasks can be visualized in a graph structure.
Airflow Scheduler: Scheduling Concepts and Terminology
In simple words, scheduling is the process of assigning resources to perform tasks. A scheduler has control over all tasks as to when and where various tasks will take place. An Airflow scheduler keeps track of all the tasks and DAGs and also set off tasks whose dependencies have been met.
In order to start the Airflow Scheduler service, all we need is one simple command:
airflow scheduler
Let’s understand what this command does. It starts the Airflow scheduler using the Airflow Scheduler configuration specified in airflow.cfg. After the start of the scheduler, our DAGs will automatically start executing based on start_date,schedule_interval, and end_date. These all parameters and scheduling concepts will be discussed in detail.
Airflow Scheduler Parameters:
- data_interval_start: data_interval_start by default is created automatically by Airflow or by the user when creating a custom timetable. data_interval_start is a DateTime object that specifies the start date and time of the data interval. Whenever the DAG Run, this parameter is returned by the DAG’s timetable.
- data_interval_end: data_interval_end by default is created automatically by Airflow or by the user when creating a custom timetable. data_interval_end is a DateTime object that specifies the end date and time of the data interval. Whenever the DAG Run, this parameter is returned by the DAG’s timetable.
- start_date: This is the starting date when our first DAG run will be executed and is the required parameter for the Airflow Scheduler. Our DAG runs will be based on the min(start_date) for all our tasks. Now, all the creation of new DAG runs is done by Airflow Scheduler which is based on our schedule_interval.
- end_date: end_date parameter is optional in Airflow Scheduling. This is the last date when our DAG will be executed.
- schedule_interval: Whenever our DAG runs in the Airflow Scheduler, each of the runs has a “schedule_interval” or we can say interval sets in the repeat frequency. It is defined using a cron expression as an “str”, or a “datetime.timedelta” object.

Airflow Scheduler Terms:
It is important to be aware of all the Scheduling concepts before going deep with working on the Airflow Scheduler. So, there are some keys described below:
- Data Interval: Data Interval is the property of Airflow 2.2 which represents a phase/period of data that each task should operate on. We can understand it with the help of an example: let’s schedule a DAG on a @hourly basis, each data interval begins at the top of the hour (minute 0) and ends at the close of the hour (minute 59).
data_interval_start = Start date of the data interval = execution date.
data_interval_end = End date of the data interval.
- Logical Date: There is not much difference between the logical state and the start of the data interval (data_interval_start). Basically, it is a replacement for the original “execution_date”. It sometimes becomes confusing while executing multiple DAG runs.
- Run After: The earliest time a DAG can be scheduled by the user is represented by the Run after. This Run after a date can be the same as the end of the data interval in our Airflow UI based on our DAG’s timeline.
- Cron Presets: Whenever our DAG runs in the Airflow Scheduler, each of the runs has a repeated frequency. It is defined using a cron expression as an “str”, or a “datetime.timedelta” object. We have the option to specify the Interval as a cron expression or a cron preset. Further, we can pass them to the schedule_interval parameter and schedule our DAG runs. Some of the presets are None, @once, @hourly, @daily, @weekly, @monthly, and @yearly.
- Timetable: The overall description of all the schedule intervals of our DAG is defined in the Airflow timetable. It also defines the steps taken when it is triggered manually or triggered by the scheduler. Airflow Timetables are essential “plugins” that are used by the Web Server and the Airflow Scheduler.
- Catchup: As we know that Airflow Scheduler examines the lifetime of the DAG (from start to end/now, one interval at a time) by default, and executes a DAG run for any interval that has not been run (or has been cleared). This is called Catchup.
- Backfill: The feature of the Airflow Scheduler that allows us to re-run DAGs on historical schedules manually is called Backfill. The basic work of this is the execution of all DAG runs that were scheduled between start_date & end_date (of your desired historical time frame). It is irrespective of the value of the catchup parameter in airflow.cfg.
Working On Airflow Scheduler:
- Whenever we start the Airflow Scheduler service, the very first work of the scheduler is to check the “dags” folder and instantiates all DAG objects in the metadata databases. A metadata database is used for storing configurations, such as variables and connections, user information, roles, and policies. It acts as the source of truth for all metadata (regarding DAGs, schedule intervals, statistics from each run, and tasks).
- Now, after checking all the dags folders, the scheduler parses it(DAG file) . Based on the scheduling parameters, it creates the necessary DAG runs.
- A TaskInstance is created, for each Task in the DAG that has to be completed. In the metadata database, these TaskInstances are set to “Scheduled”.
- After TaskInstances are set to Scheduled, all the tasks are searched. These are done by the primary scheduler which searches the database for all tasks that are in the “Scheduled” state. It passes them on to the executors (with the state changed to “Queued”).
- As the state has been changed to Queued, we can pick up tasks from the queue. Now, we can start performing them, depending on the execution configuration. Once the task is completed and is removed from the queue, its state is changed from “Queued” to “Running.
- Almost the work is done. As soon as we have completed our job, the tasks’ status is changed to the final state (finished, failed, etc.). Now, this update will be reflected in the Airflow Scheduler.
Similar Reads
Airflow UI
Pre-requisite: Airflow The user interface (UI) feature in Airflow helps us understand, monitor, and troubleshoot our data pipelines. UI serves to be a great tool that gives us insights into our DAGs and DAG runs. Important Views in Airflow UIDAG ViewThe DAG view in Airflow UI gives us a list of all
4 min read
SCAN - EDF Scheduling in OS
SCAN-EDF is a disk scheduling algorithm that is the same as the EDF algorithm but utilizes the seek optimization technique of the 'SCAN' algorithm when similar deadlines occur. It is a hybrid algorithm. To understand the importance of this algorithm, let us first look at the drawbacks of strictly fo
4 min read
Anticipatory Scheduling
Anticipatory Scheduling is type of Input/Output scheduling in Hard Disk which improves throughput of hard disk and enhances its efficiency by predicting synchronous read operation in near future. History : All existing disk scheduling algorithms are seek reducing algorithms which jump on to next req
4 min read
What is Batch Scheduling?
Batch scheduling is a manufacturing approach wherein products are assembled in groups, referred to as "batches". In this technique, each step in the production process is simultaneously applied to a group of items, and the batch progresses to the next stage only after the entire batch is completed.
7 min read
Fair-share CPU scheduling
Fair-share scheduling is a scheduling algorithm that was first designed by Judy Kay and Piers Lauder at Sydney University in the 1980s. It is a scheduling algorithm for computer operating systems that dynamically distributes the time quanta "equally" to its users. Time quantum is the processor time
2 min read
N-Step-SCAN disk scheduling
The input-output requests that are coming from the disk are scheduled by the operating system and that scheduling of the disk is known as disk scheduling. Disk scheduling is important since multiple requests come from processes for disk but only one disk is assigned to process at a time. Seek time i
5 min read
Rate-monotonic scheduling
Rate monotonic scheduling is a priority algorithm that belongs to the static priority scheduling category of Real Time Operating Systems. It is preemptive in nature. The priority is decided according to the cycle time of the processes that are involved. If the process has a small job duration, then
4 min read
List scheduling in Operating System
Prerequisite - CPU Scheduling List Scheduling also known as Priority List Based Scheduling is a scheduling technique in which an ordered list of processes are made by assigning them some priorities. So, basically what happens is, a list of processes that are ready to be executed at a given point is
3 min read
What is Apache Airflow?
Apache Airflow is an open-source tool to programmatically author, schedule, and monitor workflows. It is used by Data Engineers for orchestrating workflows or pipelines. One can easily visualize your data pipelines' dependencies, progress, logs, code, trigger tasks, and success status. Complex data
3 min read
Types of Scheduling Queues
Process Queues play an important role in process scheduling. This article will discuss different types of Scheduling Queues in detail along with their characteristics. Types of Scheduling QueuesWe will look at the numerous types of Scheduling Queues used in computer systems in the following sub-topi
3 min read