0% found this document useful (0 votes)
301 views

2 - Apache Airflow

This document discusses Apache Airflow, a platform for authoring, scheduling, and monitoring workflows or data pipelines. It introduces data pipelines as directed acyclic graphs (DAGs) of tasks with dependencies. Apache Airflow allows defining and executing such DAGs by handling dependencies and ensuring tasks run in the correct order. The document also discusses setting up an Airflow environment using Docker to manage dependencies and facilitate sharing development and production environments.

Uploaded by

Achmad Ardi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
301 views

2 - Apache Airflow

This document discusses Apache Airflow, a platform for authoring, scheduling, and monitoring workflows or data pipelines. It introduces data pipelines as directed acyclic graphs (DAGs) of tasks with dependencies. Apache Airflow allows defining and executing such DAGs by handling dependencies and ensuring tasks run in the correct order. The document also discusses setting up an Airflow environment using Docker to manage dependencies and facilitate sharing development and production environments.

Uploaded by

Achmad Ardi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Apache Airflow https://www.youtube.com/watch?

v=AHMm1wfGuHE
file:///E:/profWork/dataEngineer_course/file/literation/
Bas_P_Harenslak,_Julian_Rutger_de_Ruiter_Data_Pipelines_with_Apache.pdf
A. Getting Started
I. Meet Apache Airflow II. Anatomy of an Airflow DAG

- Objectives
1. Showing how data pipelines can be represented
in workflows as graphs of tasks
2. Understanding how airflow fits into the
ecosystem of workflow managers
3. Determining if Airflow is a good fit for you

(1.1) Introducing data pipelines

- Data pipelines generally consist of several tasks or


actions that need to be executed to achieve the
desired result
- For example say we want to build a small weather
dashboard that tells us the weather will be like in
the coming week. To implement this live weather
dashboard, we need to perform something like the
following steps :
1) Fetch weather forecast data from a weather
API
2) Clean or otherwise transform the fetched data
(e.g. converting temperatures from Fahrenheit
to Celcius or vice versa), so that the data suits
our purpose.
3) Push the transformed data to the weather
dashboard

- As you can see, this relatively simple pipeline


already consists of three different tasks that each
perform part of the work.
- Moreover, these tasks need to be executed in a
specific order, as it (for example) doesn’t make
sense to try transforming the data before fetching
it. Similarly, we can’t push any new data to the
dashboard until it has undergone the required
transformations. As such, we need to make sure
that this implicit task order is also enforced when
running this data process

(1.1.1) Data pipelines as graphs

- One way to make dependencies between tasks


more explicit is to draw the data pipeline as a
graph.
- In this graph – based representation, tasks are
represented as nodes in the graph, while
dependencies between tasks are represented by
directed edges between the task nodes.
- The direction of the edge indicates the direction of
the dependency, with an edge pointing from task A
to task B, indicating the task A needs to be
completed before task B can start/
- Note that this type of graph is generally called a
directed graph, due to the directions in the graph
edges

- This type of above graph is typically called a


directed acyclic graph (DAG), as the graph
contains directed edges and does not contain any
loops or cycles (acyclic). This acyclic property is
extremely important, as it prevents us from running
into circular dependencies between task (where
task A depends on task B and vice versa). These
circular dependencies become problematic when
trying to execute the graph, as we run into a
situation where task 2 can only execute once task 3
has been completed, while task 3 can only execute
once task 2 has been completed. This logical
inconsistency leads to a deadlock type of situation,
in which neither task 2 nor 3 can run, preventing us
from executing the graph

- From above diagram, that this representation from


cyclic graph representations, which can contain
cycles to illustrate iterative parts of algorithms (for
example), as are common in many machince
learning applications. however, the acyclic
property of DAGs is used by Airflow (and many
other workflow managers) to efficiently resolve
and execute these graphs of tasks
- The DAG will make sure that operators run in the
correct order

(1.1.2) Executing a pipeline graph

- A nice property of this DAG representation is that


it provides a relatively straightforward algorithm
that we can use for running the pipeline.
Conceptually, this algorithm consists of the
following steps : (checkpoint : page 6)

(1.1.3) Pipeline graph vs. sequential scripts

(1.1.4) Running pipeline using workflow managers

Scheduling in Airflow Templating tasks using the Airflow context


Defining dependencies between tasks
B. Beyond The Basics
Triggering workflows Communicating with external systems
Building custom components Testing
Running tasks in containers
C. Airflow in Practice
Best practices Operating Airflow in production
Securing Airflow Project : Finding the fastest way to get around NYC
D. In the clouds
Airflow in the clouds Airflow on OWS
Airflow on Azure Airflow in GCP

Introduction to Apache Airflow Set up airflow environment with docker

- Introduction - Airflow problem ?


a. Airflow is a platform to programmatically 1. Open source software grows at an
author, schedule and monitor workflows or overmhelming pace
data pipelines 2. Airflow is built to integrate with all
b. Is a sequence of tasks databases, system, cloud environments
c. Started on a schedule or trigerred by an event 3. Managing and maintaining all of the
d. Frequently used to handle big data processing dependencies changes will be rally difficult
pipelines 4. Takes lots of time to set up, and config
- A typical workflows Airflow environment
a. Download data from source 5. How to share development and production
b. Send data somewhere else to process environments for all developers
c. Monitor when the process is completed - To see apache airflow documentation :
d. Get the result and generate the report
e. Send the report out by email https://airflow.apache.org/
- A traditional ETL approach
- GitHub repo Tuan Vu
Database -> CRON Scripts -> HDFS
https://github.com/tuanavu/airflow-tutorial
a. Writing a script to pull data from database and
send it to HDFS (Hadoop Distributed File - Docker overview
System) to process 1. Docker is an open platform for developing,
b. Schedule the script as a cronjob shipping, and running applications
2. Docker provides the ability to package and
- Problems run an application in a loosely isolated
a. Failures environment called a container.
Retry if failure happens (how many times ? 3. The isolation and security allow you to run
how often?) many containers simultaneously on a given
b. Monitoring host, regardless its operating system : Mac,
Success or failure status, how long does the Windows, PC, cloud, data center
process runs ? 4. Containers are lightweight because they
c. Dependencies don’t need the extra load of a hypervisor, but
1) Data dependencies upstream data is run directly within the host machine’s
missing kernel. This means you can run more
2) Execution dependencies: job 2 runs after containers on a given hardware combination
job 1 is finished than if you were using virtual machines
d. Scalability
There is no centralized scheduler between
different cron machines
e. Deployment
Deploy new changes constantly
f. Process historic data
Backfill / rerun historical data
- Apache Airflow
a. The project joined the Apache Software 5. Benefits of using Docker
Foundation’s incubation program in 2016 a. Docker is freeling us from the task of
b. A workflow (data – pipeline) management managing, maintaining all of the Airflow
system developed by Airbnb dependencies, and deployment
c. Used by more than 200 companies : Airbnb, b. Easy to share and deploy different
Yahoo, Paypal, Intel, Stripe versions and environments
d. Airflow DAG c. Keep track through Github and tas and
a) A workflow as a Directed Acyclic Graph releases
(DAG) with multiple tasks which can be d. Ease of deploymeny from testing to
executed independently production environment
b) Airflow DAGs are composed of Tasks - Getting started
e. sad These instructions will get you a copy of the
- Dsa project up and running on your local machine
for development and testing purposes
1. Clone the repo
2. Install the prerequisites
a) Install Git

b) Install Docker
c) Install Docker Compose

d) Following the Airflow release from


Python Package Index
3. Run the service
4. Check https://localhost:8080
5. Done

- Practice
1. Clone the repo

https://github.com/tuanavu/airflow-tutorial

You might also like