0% found this document useful (0 votes)
305 views

Apache Airflow Documentation

This document describes how Airflow schedules and triggers DAG runs. It explains that the scheduler monitors DAGs and triggers task instances when dependencies are met. It also covers DAG runs, backfill, catchup, external triggers, and things to keep in mind regarding scheduling and triggers.

Uploaded by

Sakshi Arts
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
305 views

Apache Airflow Documentation

This document describes how Airflow schedules and triggers DAG runs. It explains that the scheduler monitors DAGs and triggers task instances when dependencies are met. It also covers DAG runs, backfill, catchup, external triggers, and things to keep in mind regarding scheduling and triggers.

Uploaded by

Sakshi Arts
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Airflow Documentation, Release 1.10.

3.10.2.29 clear

Clear a set of task instance, as if they never ran

airflow clear [-h] [-t TASK_REGEX] [-s START_DATE] [-e END_DATE] [-sd SUBDIR]
[-u] [-d] [-c] [-f] [-r] [-x] [-xp] [-dx]
dag_id

Positional Arguments

dag_id The id of the dag

Named Arguments

-t, --task_regex The regex to filter specific task_ids to backfill (optional)


-s, --start_date Override start_date YYYY-MM-DD
-e, --end_date Override end_date YYYY-MM-DD
-sd, --subdir File location or directory from which to look for the dag. Defaults to ‘[AIR-
FLOW_HOME]/dags’ where [AIRFLOW_HOME] is the value you set for ‘AIR-
FLOW_HOME’ config you set in ‘airflow.cfg’
Default: “[AIRFLOW_HOME]/dags”
-u, --upstream Include upstream tasks
Default: False
-d, --downstream Include downstream tasks
Default: False
-c, --no_confirm Do not request confirmation
Default: False
-f, --only_failed Only failed jobs
Default: False
-r, --only_running Only running jobs
Default: False
-x, --exclude_subdags Exclude subdags
Default: False
-xp, --exclude_parentdag Exclude ParentDAGS if the task cleared is a part of a SubDAG
Default: False
-dx, --dag_regex Search dag_id as regex instead of exact string
Default: False

112 Chapter 3. Content


Airflow Documentation, Release 1.10.2

3.10.2.30 list_users

List accounts for the Web UI

airflow list_users [-h]

3.10.2.31 next_execution

Get the next execution datetime of a DAG.

airflow next_execution [-h] [-sd SUBDIR] dag_id

Positional Arguments

dag_id The id of the dag

Named Arguments

-sd, --subdir File location or directory from which to look for the dag. Defaults to ‘[AIR-
FLOW_HOME]/dags’ where [AIRFLOW_HOME] is the value you set for ‘AIR-
FLOW_HOME’ config you set in ‘airflow.cfg’
Default: “[AIRFLOW_HOME]/dags”

3.10.2.32 upgradedb

Upgrade the metadata database to latest version

airflow upgradedb [-h]

3.10.2.33 delete_dag

Delete all DB records related to the specified DAG

airflow delete_dag [-h] [-y] dag_id

Positional Arguments

dag_id The id of the dag

Named Arguments

-y, --yes Do not prompt to confirm reset. Use with care!


Default: False

3.10. Command Line Interface 113


Airflow Documentation, Release 1.10.2

3.11 Scheduling & Triggers

The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been
met. Behind the scenes, it spins up a subprocess, which monitors and stays in sync with a folder for all DAG objects
it may contain, and periodically (every minute or so) collects DAG parsing results and inspects active tasks to see
whether they can be triggered.
The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off,
all you need to do is execute airflow scheduler. It will use the configuration specified in airflow.cfg.
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger
soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of
the period.
The scheduler starts an instance of the executor specified in the your airflow.cfg. If it happens to
be the LocalExecutor, tasks will be executed as subprocesses; in the case of CeleryExecutor and
MesosExecutor, tasks are executed remotely.
To start a scheduler, simply run the command:

airflow scheduler

3.11.1 DAG Runs

A DAG Run is an object representing an instantiation of the DAG in time.


Each DAG may or may not have a schedule, which informs how DAG Runs are created. schedule_interval
is defined as a DAG arguments, and receives preferably a cron expression as a str, or a datetime.timedelta
object. Alternatively, you can also use one of these cron “preset”:

preset meaning cron


None Don’t schedule, use for exclusively “externally triggered” DAGs
@once Schedule once and only once
@hourly Run once an hour at the beginning of the hour 0 * * * *
@daily Run once a day at midnight 0 0 * * *
@weekly Run once a week at midnight on Sunday morning 0 0 * * 0
@monthly Run once a month at midnight of the first day of the month 0 0 1 * *
@yearly Run once a year at midnight of January 1 0 0 1 1 *

Note: Use schedule_interval=None and not schedule_interval='None' when you don’t want to
schedule your DAG.
Your DAG will be instantiated for each schedule, while creating a DAG Run entry for each schedule.
DAG runs have a state associated to them (running, failed, success) and informs the scheduler on which set of schedules
should be evaluated for task submissions. Without the metadata at the DAG run level, the Airflow scheduler would
have much more work to do in order to figure out what tasks should be triggered and come to a crawl. It might also
create undesired processing when changing the shape of your DAG, by say adding in new tasks.

3.11.2 Backfill and Catchup

An Airflow DAG with a start_date, possibly an end_date, and a schedule_interval defines a series of
intervals which the scheduler turn into individual Dag Runs and execute. A key capability of Airflow is that these

114 Chapter 3. Content


Airflow Documentation, Release 1.10.2

DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine the lifetime of the DAG (from
start to end/now, one interval at a time) and kick off a DAG Run for any interval that has not been run (or has been
cleared). This concept is called Catchup.
If your DAG is written to handle its own catchup (IE not limited to the interval, but instead to “Now” for instance.),
then you will want to turn catchup off (Either on the DAG itself with dag.catchup = False) or by default at the
configuration file level with catchup_by_default = False. What this will do, is to instruct the scheduler to
only create a DAG Run for the most current instance of the DAG interval series.

"""
Code that goes along with the Airflow tutorial located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '@hourly',
}

dag = DAG('tutorial', catchup=False, default_args=default_args)

In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, (or from the command
line), a single DAG Run will be created, with an execution_date of 2016-01-01, and the next one will be created
just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02.
If the dag.catchup value had been True instead, the scheduler would have created a DAG Run for each completed
interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval hasn’t completed) and
the scheduler will execute them sequentially. This behavior is great for atomic datasets that can easily be split into
periods. Turning catchup off is great if your DAG Runs perform backfill internally.

3.11.3 External Triggers

Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag
command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated
to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.

3.11.4 To Keep in Mind

• The first DAG Run is created based on the minimum start_date for the tasks in your DAG.
• Subsequent DAG Runs are created by the scheduler process, based on your DAG’s schedule_interval,
sequentially.
• When clearing a set of tasks’ state in hope of getting them to re-run, it is important to keep in mind the DAG
Run’s state too as it defines whether the scheduler should look into triggering tasks for that run.

3.11. Scheduling & Triggers 115


Airflow Documentation, Release 1.10.2

Here are some of the ways you can unblock tasks:


• From the UI, you can clear (as in delete the status of) individual task instances from the task instances dialog,
while defining whether you want to includes the past/future and the upstream/downstream dependencies. Note
that a confirmation window comes next and allows you to see the set you are about to clear. You can also clear
all task instances associated with the dag.
• The CLI command airflow clear -h has lots of options when it comes to clearing task instance states,
including specifying date ranges, targeting task_ids by specifying a regular expression, flags for including up-
stream and downstream relatives, and targeting task instances in specific states (failed, or success)
• Clearing a task instance will no longer delete the task instance record. Instead it updates max_tries and set the
current task instance state to be None.
• Marking task instances as failed can be done through the UI. This can be used to stop running task instances.
• Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or for
instance when the fix has been applied outside of Airflow.
• The airflow backfill CLI subcommand has a flag to --mark_success and allows selecting subsec-
tions of the DAG as well as specifying date ranges.

3.12 Plugins

Airflow has a simple plugin manager built-in that can integrate external features to its core by simply dropping files in
your $AIRFLOW_HOME/plugins folder.
The python modules in the plugins folder get imported, and hooks, operators, sensors, macros, executors and
web views get integrated to Airflow’s main collections and become available for use.

3.12.1 What for?

Airflow offers a generic toolbox for working with data. Different organizations have different stacks and different
needs. Using Airflow plugins can be a way for companies to customize their Airflow installation to reflect their
ecosystem.
Plugins can be used as an easy way to write, share and activate new sets of features.
There’s also a need for a set of more complex applications to interact with different flavors of data and metadata.
Examples:
• A set of tools to parse Hive logs and expose Hive metadata (CPU /IO / phases/ skew /. . . )
• An anomaly detection framework, allowing people to collect metrics, set thresholds and alerts
• An auditing tool, helping understand who accesses what
• A config-driven SLA monitoring tool, allowing you to set monitored tables and at what time they should land,
alert people, and expose visualizations of outages
• ...

3.12.2 Why build on top of Airflow?

Airflow has many components that can be reused when building an application:
• A web server you can use to render your views

116 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• A metadata database to store your models


• Access to your databases, and knowledge of how to connect to them
• An array of workers that your application can push workload to
• Airflow is deployed, you can just piggy back on its deployment logistics
• Basic charting capabilities, underlying libraries and abstractions

3.12.3 Interface

To create a plugin you will need to derive the airflow.plugins_manager.AirflowPlugin class and refer-
ence the objects you want to plug into Airflow. Here’s what the class you need to derive looks like:

class AirflowPlugin(object):
# The name of your plugin (str)
name = None
# A list of class(es) derived from BaseOperator
operators = []
# A list of class(es) derived from BaseSensorOperator
sensors = []
# A list of class(es) derived from BaseHook
hooks = []
# A list of class(es) derived from BaseExecutor
executors = []
# A list of references to inject into the macros namespace
macros = []
# A list of objects created from a class derived
# from flask_admin.BaseView
admin_views = []
# A list of Blueprint object created from flask.Blueprint. For use with the flask_
˓→admin based GUI

flask_blueprints = []
# A list of menu links (flask_admin.base.MenuLink). For use with the flask_admin
˓→based GUI

menu_links = []
# A list of dictionaries containing FlaskAppBuilder BaseView object and some
˓→metadata. See example below

appbuilder_views = []
# A list of dictionaries containing FlaskAppBuilder BaseView object and some
˓→metadata. See example below

appbuilder_menu_items = []

3.12.4 Example

The code below defines a plugin that injects a set of dummy object definitions in Airflow.

# This is the class you derive to create a plugin


from airflow.plugins_manager import AirflowPlugin

from flask import Blueprint


from flask_admin import BaseView, expose
from flask_admin.base import MenuLink

# Importing base classes that we need to derive


(continues on next page)

3.12. Plugins 117


Airflow Documentation, Release 1.10.2

(continued from previous page)


from airflow.hooks.base_hook import BaseHook
from airflow.models import BaseOperator
from airflow.sensors.base_sensor_operator import BaseSensorOperator
from airflow.executors.base_executor import BaseExecutor

# Will show up under airflow.hooks.test_plugin.PluginHook


class PluginHook(BaseHook):
pass

# Will show up under airflow.operators.test_plugin.PluginOperator


class PluginOperator(BaseOperator):
pass

# Will show up under airflow.sensors.test_plugin.PluginSensorOperator


class PluginSensorOperator(BaseSensorOperator):
pass

# Will show up under airflow.executors.test_plugin.PluginExecutor


class PluginExecutor(BaseExecutor):
pass

# Will show up under airflow.macros.test_plugin.plugin_macro


def plugin_macro():
pass

# Creating a flask admin BaseView


class TestView(BaseView):
@expose('/')
def test(self):
# in this example, put your test_plugin/test.html template at airflow/plugins/
˓→templates/test_plugin/test.html

return self.render("test_plugin/test.html", content="Hello galaxy!")


v = TestView(category="Test Plugin", name="Test View")

# Creating a flask blueprint to integrate the templates and static folder


bp = Blueprint(
"test_plugin", __name__,
template_folder='templates', # registers airflow/plugins/templates as a Jinja
˓→template folder

static_folder='static',
static_url_path='/static/test_plugin')

ml = MenuLink(
category='Test Plugin',
name='Test Menu Link',
url='https://airflow.apache.org/')

# Creating a flask appbuilder BaseView


class TestAppBuilderBaseView(AppBuilderBaseView):
@expose("/")
def test(self):
return self.render("test_plugin/test.html", content="Hello galaxy!")
v_appbuilder_view = TestAppBuilderBaseView()
v_appbuilder_package = {"name": "Test View",
"category": "Test Plugin",
"view": v_appbuilder_view}

(continues on next page)

118 Chapter 3. Content


Airflow Documentation, Release 1.10.2

(continued from previous page)


# Creating a flask appbuilder Menu Item
appbuilder_mitem = {"name": "Google",
"category": "Search",
"category_icon": "fa-th",
"href": "https://www.google.com"}

# Defining the plugin class


class AirflowTestPlugin(AirflowPlugin):
name = "test_plugin"
operators = [PluginOperator]
sensors = [PluginSensorOperator]
hooks = [PluginHook]
executors = [PluginExecutor]
macros = [plugin_macro]
admin_views = [v]
flask_blueprints = [bp]
menu_links = [ml]
appbuilder_views = [v_appbuilder_package]
appbuilder_menu_items = [appbuilder_mitem]

3.12.5 Note on role based views

Airflow 1.10 introduced role based views using FlaskAppBuilder. You can configure which UI is used by setting rbac
= True. To support plugin views and links for both versions of the UI and maintain backwards compatibility, the fields
appbuilder_views and appbuilder_menu_items were added to the AirflowTestPlugin class.

3.12.6 Plugins as Python packages

It is possible to load plugins via ‘setuptools’ entrypoint<https://packaging.python.org/guides/creating-and-


discovering-plugins/#using-package-metadata>‘_ mechanism. To do this link your plugin using an entrypoint in
your package. If the package is installed, airflow will automatically load the registered plugins from the entrypoint
list.
_Note_: Neither the entrypoint name (eg, my_plugin) nor the name of the plugin class will contribute towards the mod-
ule and class name of the plugin itself. The structure is determined by airflow.plugins_manager.AirflowPlugin.name
and the class name of the plugin component with the pattern airflow.{component}.{name}.{component_class_name}.

# my_package/my_plugin.py
from airflow.plugins_manager import AirflowPlugin
from airflow.models import BaseOperator
from airflow.hooks.base_hook import BaseHook

class MyOperator(BaseOperator):
pass

class MyHook(BaseHook):
pass

class MyAirflowPlugin(AirflowPlugin):
name = 'my_namespace'
operators = [MyOperator]
hooks = [MyHook]

3.12. Plugins 119


Airflow Documentation, Release 1.10.2

from setuptools import setup

setup(
name="my-package",
...
entry_points = {
'airflow.plugins': [
'my_plugin = my_package.my_plugin:MyAirflowPlugin'
]
}
)

This will create a hook, and an operator accessible at:


• airflow.hooks.my_namespace.MyHook
• airflow.operators.my_namespace.MyOperator

3.13 Security

By default, all gates are opened. An easy way to restrict access to the web application is to do it at the network level,
or by using SSH tunnels.
It is however possible to switch on authentication by either using one of the supplied backends or creating your own.
Be sure to checkout Experimental Rest API for securing the API.

Note: Airflow uses the config parser of Python. This config parser interpolates ‘%’-signs. Make sure escape any %
signs in your config file (but not environment variables) as %%, otherwise Airflow might leak these passwords on a
config parser exception to a log.

3.13.1 Web Authentication

3.13.1.1 Password

Note: This is for flask-admin based web UI only. If you are using FAB-based web UI with RBAC feature, please use
command line interface create_user to create accounts, or do that in the FAB-based UI itself.

One of the simplest mechanisms for authentication is requiring users to specify a password before logging in. Password
authentication requires the used of the password subpackage in your requirements file. Password hashing uses
bcrypt before storing passwords.

[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth

When password auth is enabled, an initial user credential will need to be created before anyone can login. An initial
user was not created in the migrations for this authentication backend to prevent default Airflow installations from
attack. Creating a new user has to be done via a Python REPL on the same machine Airflow is installed.

120 Chapter 3. Content


Airflow Documentation, Release 1.10.2

# navigate to the airflow installation directory


$ cd ~/airflow
$ python
Python 2.7.9 (default, Feb 10 2015, 03:28:08)
Type "help", "copyright", "credits" or "license" for more information.
>>> import airflow
>>> from airflow import models, settings
>>> from airflow.contrib.auth.backends.password_auth import PasswordUser
>>> user = PasswordUser(models.User())
>>> user.username = 'new_user_name'
>>> user.email = '[email protected]'
>>> user.password = 'set_the_password'
>>> session = settings.Session()
>>> session.add(user)
>>> session.commit()
>>> session.close()
>>> exit()

3.13.1.2 LDAP

To turn on LDAP authentication configure your airflow.cfg as follows. Please note that the example uses an
encrypted connection to the ldap server as we do not want passwords be readable on the network level.
Additionally, if you are using Active Directory, and are not explicitly specifying an OU that your users are in, you will
need to change search_scope to “SUBTREE”.
Valid search_scope options can be found in the ldap3 Documentation

[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.ldap_auth

[ldap]
# set a connection without encryption: uri = ldap://<your.ldap.server>:<port>
uri = ldaps://<your.ldap.server>:<port>
user_filter = objectClass=*
# in case of Active Directory you would use: user_name_attr = sAMAccountName
user_name_attr = uid
# group_member_attr should be set accordingly with *_filter
# eg :
# group_member_attr = groupMembership
# superuser_filter = groupMembership=CN=airflow-super-users...
group_member_attr = memberOf
superuser_filter = memberOf=CN=airflow-super-users,OU=Groups,OU=RWC,OU=US,OU=NORAM,
˓→DC=example,DC=com

data_profiler_filter = memberOf=CN=airflow-data-profilers,OU=Groups,OU=RWC,OU=US,
˓→OU=NORAM,DC=example,DC=com

bind_user = cn=Manager,dc=example,dc=com
bind_password = insecure
basedn = dc=example,dc=com
cacert = /etc/ca/ldap_ca.crt
# Set search_scope to one of them: BASE, LEVEL , SUBTREE
# Set search_scope to SUBTREE if using Active Directory, and not specifying an
˓→Organizational Unit

search_scope = LEVEL

The superuser_filter and data_profiler_filter are optional. If defined, these configurations allow you to specify LDAP

3.13. Security 121


Airflow Documentation, Release 1.10.2

groups that users must belong to in order to have superuser (admin) and data-profiler permissions. If undefined, all
users will be superusers and data profilers.

3.13.1.3 Roll your own

Airflow uses flask_login and exposes a set of hooks in the airflow.default_login module. You can alter
the content and make it part of the PYTHONPATH and configure it as a backend in airflow.cfg.

[webserver]
authenticate = True
auth_backend = mypackage.auth

3.13.2 Multi-tenancy

You can filter the list of dags in webserver by owner name when authentication is turned on by setting
webserver:filter_by_owner in your config. With this, a user will see only the dags which it is owner of,
unless it is a superuser.

[webserver]
filter_by_owner = True

3.13.3 Kerberos

Airflow has initial support for Kerberos. This means that airflow can renew kerberos tickets for itself and store it in
the ticket cache. The hooks and dags can make use of ticket to authenticate against kerberized services.

3.13.3.1 Limitations

Please note that at this time, not all hooks have been adjusted to make use of this functionality. Also it does not
integrate kerberos into the web interface and you will have to rely on network level security for now to make sure your
service remains secure.
Celery integration has not been tried and tested yet. However, if you generate a key tab for every host and launch a
ticket renewer next to every worker it will most likely work.

3.13.3.2 Enabling kerberos

Airflow

To enable kerberos you will need to generate a (service) key tab.

# in the kadmin.local or kadmin shell, create the airflow principal


kadmin: addprinc -randkey airflow/[email protected]

# Create the airflow keytab file that will contain the airflow principal
kadmin: xst -norandkey -k airflow.keytab airflow/fully.qualified.domain.name

Now store this file in a location where the airflow user can read it (chmod 600). And then add the following to your
airflow.cfg

122 Chapter 3. Content


Airflow Documentation, Release 1.10.2

[core]
security = kerberos

[kerberos]
keytab = /etc/airflow/airflow.keytab
reinit_frequency = 3600
principal = airflow

Launch the ticket renewer by

# run ticket renewer


airflow kerberos

Hadoop

If want to use impersonation this needs to be enabled in core-site.xml of your hadoop config.

<property>
<name>hadoop.proxyuser.airflow.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.airflow.users</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.airflow.hosts</name>
<value>*</value>
</property>

Of course if you need to tighten your security replace the asterisk with something more appropriate.

3.13.3.3 Using kerberos authentication

The hive hook has been updated to take advantage of kerberos authentication. To allow your DAGs to use it, simply
update the connection details with, for example:

{ "use_beeline": true, "principal": "hive/[email protected]"}

Adjust the principal to your settings. The _HOST part will be replaced by the fully qualified domain name of the
server.
You can specify if you would like to use the dag owner as the user for the connection or the user specified in the login
section of the connection. For the login user, specify the following as extra:

{ "use_beeline": true, "principal": "hive/[email protected]", "proxy_user": "login"}

For the DAG owner use:

{ "use_beeline": true, "principal": "hive/[email protected]", "proxy_user": "owner"}

and in your DAG, when initializing the HiveOperator, specify:

3.13. Security 123


Airflow Documentation, Release 1.10.2

run_as_owner=True

To use kerberos authentication, you must install Airflow with the kerberos extras group:

pip install airflow[kerberos]

3.13.4 OAuth Authentication

3.13.4.1 GitHub Enterprise (GHE) Authentication

The GitHub Enterprise authentication backend can be used to authenticate users against an installation of GitHub
Enterprise using OAuth2. You can optionally specify a team whitelist (composed of slug cased team names) to restrict
login to only members of those teams.

[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.github_enterprise_auth

[github_enterprise]
host = github.example.com
client_id = oauth_key_from_github_enterprise
client_secret = oauth_secret_from_github_enterprise
oauth_callback_route = /example/ghe_oauth/callback
allowed_teams = 1, 345, 23

Note: If you do not specify a team whitelist, anyone with a valid account on your GHE installation will be able to
login to Airflow.

To use GHE authentication, you must install Airflow with the github_enterprise extras group:

pip install airflow[github_enterprise]

Setting up GHE Authentication

An application must be setup in GHE before you can use the GHE authentication backend. In order to setup an
application:
1. Navigate to your GHE profile
2. Select ‘Applications’ from the left hand nav
3. Select the ‘Developer Applications’ tab
4. Click ‘Register new application’
5. Fill in the required information (the ‘Authorization callback URL’ must be fully qualified e.g. http://airflow.
example.com/example/ghe_oauth/callback)
6. Click ‘Register application’
7. Copy ‘Client ID’, ‘Client Secret’, and your callback route to your airflow.cfg according to the above example

124 Chapter 3. Content


Airflow Documentation, Release 1.10.2

Using GHE Authentication with github.com

It is possible to use GHE authentication with github.com:


1. Create an Oauth App
2. Copy ‘Client ID’, ‘Client Secret’ to your airflow.cfg according to the above example
3. Set host = github.com and oauth_callback_route = /oauth/callback in airflow.cfg

3.13.4.2 Google Authentication

The Google authentication backend can be used to authenticate users against Google using OAuth2. You must specify
the domains to restrict login, separated with a comma, to only members of those domains.

[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.google_auth

[google]
client_id = google_client_id
client_secret = google_client_secret
oauth_callback_route = /oauth2callback
domain = "example1.com,example2.com"

To use Google authentication, you must install Airflow with the google_auth extras group:

pip install airflow[google_auth]

Setting up Google Authentication

An application must be setup in the Google API Console before you can use the Google authentication backend. In
order to setup an application:
1. Navigate to https://console.developers.google.com/apis/
2. Select ‘Credentials’ from the left hand nav
3. Click ‘Create credentials’ and choose ‘OAuth client ID’
4. Choose ‘Web application’
5. Fill in the required information (the ‘Authorized redirect URIs’ must be fully qualified e.g. http://airflow.
example.com/oauth2callback)
6. Click ‘Create’
7. Copy ‘Client ID’, ‘Client Secret’, and your redirect URI to your airflow.cfg according to the above example

3.13.5 SSL

SSL can be enabled by providing a certificate and key. Once enabled, be sure to use “https://” in your browser.

[webserver]
web_server_ssl_cert = <path to cert>
web_server_ssl_key = <path to key>

3.13. Security 125


Airflow Documentation, Release 1.10.2

Enabling SSL will not automatically change the web server port. If you want to use the standard port 443, you’ll need
to configure that too. Be aware that super user privileges (or cap_net_bind_service on Linux) are required to listen on
port 443.

# Optionally, set the server to listen on the standard SSL port.


web_server_port = 443
base_url = http://<hostname or IP>:443

Enable CeleryExecutor with SSL. Ensure you properly generate client and server certs and keys.

[celery]
ssl_active = True
ssl_key = <path to key>
ssl_cert = <path to cert>
ssl_cacert = <path to cacert>

3.13.6 Impersonation

Airflow has the ability to impersonate a unix user while running task instances based on the task’s run_as_user
parameter, which takes a user’s name.
NOTE: For impersonations to work, Airflow must be run with sudo as subtasks are run with sudo -u and permissions
of files are changed. Furthermore, the unix user needs to exist on the worker. Here is what a simple sudoers file entry
could look like to achieve this, assuming as airflow is running as the airflow user. Note that this means that the airflow
user must be trusted and treated the same way as the root user.

airflow ALL=(ALL) NOPASSWD: ALL

Subtasks with impersonation will still log to the same folder, except that the files they log to will have permissions
changed such that only the unix user can write to it.

3.13.6.1 Default Impersonation

To prevent tasks that don’t use impersonation to be run with sudo privileges, you can set the
core:default_impersonation config which sets a default user impersonate if run_as_user is not set.

[core]
default_impersonation = airflow

3.13.7 Flower Authentication

Basic authentication for Celery Flower is supported.


You can specify the details either as an optional argument in the Flower process launching command, or as a configu-
ration item in your airflow.cfg. For both cases, please provide user:password pairs separated by a comma.

airflow flower --basic_auth=user1:password1,user2:password2

[celery]
flower_basic_auth = user1:password1,user2:password2

126 Chapter 3. Content


Airflow Documentation, Release 1.10.2

3.14 Time zones

Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the
database. It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not
convert them to the end user’s time zone in the user interface. There it will always be displayed in UTC. Also
templates used in Operators are not converted. Time zone information is exposed and it is up to the writer of DAG
what do with it.
This is handy if your users live in more than one time zone and you want to display datetime information according to
each user’s wall clock.
Even if you are running Airflow in only one time zone it is still good practice to store data in UTC in your database
(also before Airflow became time zone aware this was also to recommended or even required setup). The main reason
is Daylight Saving Time (DST). Many countries have a system of DST, where clocks are moved forward in spring
and backward in autumn. If you’re working in local time, you’re likely to encounter errors twice a year, when the
transitions happen. (The pendulum and pytz documentation discusses these issues in greater detail.) This probably
doesn’t matter for a simple DAG, but it’s a problem if you are in, for example, financial services where you have end
of day deadlines to meet.
The time zone is set in airflow.cfg. By default it is set to utc, but you change it to use the system’s settings or an
arbitrary IANA time zone, e.g. Europe/Amsterdam. It is dependent on pendulum, which is more accurate than pytz.
Pendulum is installed when you install Airflow.
Please note that the Web UI currently only runs in UTC.

3.14.1 Concepts

3.14.1.1 Naïve and aware datetime objects

Python’s datetime.datetime objects have a tzinfo attribute that can be used to store time zone information, represented
as an instance of a subclass of datetime.tzinfo. When this attribute is set and describes an offset, a datetime object is
aware. Otherwise, it’s naive.
You can use timezone.is_aware() and timezone.is_naive() to determine whether datetimes are aware or naive.
Because Airflow uses time-zone-aware datetime objects. If your code creates datetime objects they need to be aware
too.

from airflow.utils import timezone

now = timezone.utcnow()
a_date = timezone.datetime(2017,1,1)

3.14.1.2 Interpretation of naive datetime objects

Although Airflow operates fully time zone aware, it still accepts naive date time objects for start_dates and end_dates
in your DAG definitions. This is mostly in order to preserve backwards compatibility. In case a naive start_date or
end_date is encountered the default time zone is applied. It is applied in such a way that it is assumed that the naive date
time is already in the default time zone. In other words if you have a default time zone setting of Europe/Amsterdam
and create a naive datetime start_date of datetime(2017,1,1) it is assumed to be a start_date of Jan 1, 2017 Amsterdam
time.

3.14. Time zones 127


Airflow Documentation, Release 1.10.2

default_args=dict(
start_date=datetime(2016, 1, 1),
owner='Airflow'
)

dag = DAG('my_dag', default_args=default_args)


op = DummyOperator(task_id='dummy', dag=dag)
print(op.owner) # Airflow

Unfortunately, during DST transitions, some datetimes don’t exist or are ambiguous. In such situations, pendulum
raises an exception. That’s why you should always create aware datetime objects when time zone support is enabled.
In practice, this is rarely an issue. Airflow gives you aware datetime objects in the models and DAGs, and most often,
new datetime objects are created from existing ones through timedelta arithmetic. The only datetime that’s often
created in application code is the current time, and timezone.utcnow() automatically does the right thing.

3.14.1.3 Default time zone

The default time zone is the time zone defined by the default_timezone setting under [core]. If you just in-
stalled Airflow it will be set to utc, which is recommended. You can also set it to system or an IANA time zone
(e.g.‘Europe/Amsterdam‘). DAGs are also evaluated on Airflow workers, it is therefore important to make sure this
setting is equal on all Airflow nodes.

[core]
default_timezone = utc

3.14.2 Time zone aware DAGs

Creating a time zone aware DAG is quite simple. Just make sure to supply a time zone aware start_date. It is
recommended to use pendulum for this, but pytz (to be installed manually) can also be used for this.

import pendulum

local_tz = pendulum.timezone("Europe/Amsterdam")

default_args=dict(
start_date=datetime(2016, 1, 1, tzinfo=local_tz),
owner='Airflow'
)

dag = DAG('my_tz_dag', default_args=default_args)


op = DummyOperator(task_id='dummy', dag=dag)
print(dag.timezone) # <Timezone [Europe/Amsterdam]>

Please note that while it is possible to set a start_date and end_date for Tasks always the DAG timezone or global
timezone (in that order) will be used to calculate the next execution date. Upon first encounter the start date or end
date will be converted to UTC using the timezone associated with start_date or end_date, then for calculations this
timezone information will be disregarded.

3.14.2.1 Templates

Airflow returns time zone aware datetimes in templates, but does not convert them to local time so they remain in
UTC. It is left up to the DAG to handle this.

128 Chapter 3. Content


Airflow Documentation, Release 1.10.2

import pendulum

local_tz = pendulum.timezone("Europe/Amsterdam")
local_tz.convert(execution_date)

3.14.2.2 Cron schedules

In case you set a cron schedule, Airflow assumes you will always want to run at the exact same time. It will then
ignore day light savings time. Thus, if you have a schedule that says run at end of interval every day at 08:00 GMT+1
it will always run end of interval 08:00 GMT+1, regardless if day light savings time is in place.

3.14.2.3 Time deltas

For schedules with time deltas Airflow assumes you always will want to run with the specified interval. So if you
specify a timedelta(hours=2) you will always want to run to hours later. In this case day light savings time will be
taken into account.

3.15 Experimental Rest API

Airflow exposes an experimental Rest API. It is available through the webserver. Endpoints are available at
/api/experimental/. Please note that we expect the endpoint definitions to change.

3.15.1 Endpoints

POST /api/experimental/dags/<DAG_ID>/dag_runs
Creates a dag_run for a given dag id.
Trigger DAG with config, example:

curl -X POST \
http://localhost:8080/api/experimental/dags/<DAG_ID>/dag_runs \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{"conf":"{\"key\":\"value\"}"}'

GET /api/experimental/dags/<DAG_ID>/dag_runs
Returns a list of Dag Runs for a specific DAG ID.
GET /api/experimental/dags/<string:dag_id>/dag_runs/<string:execution_date>
Returns a JSON with a dag_run’s public instance variables. The format for the <string:execution_date> is
expected to be “YYYY-mm-DDTHH:MM:SS”, for example: “2016-11-16T11:34:15”.
GET /api/experimental/test
To check REST API server correct work. Return status ‘OK’.
GET /api/experimental/dags/<DAG_ID>/tasks/<TASK_ID>
Returns info for a task.
GET /api/experimental/dags/<DAG_ID>/dag_runs/<string:execution_date>/tasks/<TASK_ID>
Returns a JSON with a task instance’s public instance variables. The format for the <string:execution_date> is
expected to be “YYYY-mm-DDTHH:MM:SS”, for example: “2016-11-16T11:34:15”.

3.15. Experimental Rest API 129


Airflow Documentation, Release 1.10.2

GET /api/experimental/dags/<DAG_ID>/paused/<string:paused>
‘<string:paused>’ must be a ‘true’ to pause a DAG and ‘false’ to unpause.
GET /api/experimental/latest_runs
Returns the latest DagRun for each DAG formatted for the UI.
GET /api/experimental/pools
Get all pools.
GET /api/experimental/pools/<string:name>
Get pool by a given name.
POST /api/experimental/pools
Create a pool.
DELETE /api/experimental/pools/<string:name>
Delete pool.

3.15.2 CLI

For some functions the cli can use the API. To configure the CLI to use the API when available configure as follows:

[cli]
api_client = airflow.api.client.json_client
endpoint_url = http://<WEBSERVER>:<PORT>

3.15.3 Authentication

Authentication for the API is handled separately to the Web Authentication. The default is to not require any au-
thentication on the API – i.e. wide open by default. This is not recommended if your Airflow webserver is publicly
accessible, and you should probably use the deny all backend:

[api]
auth_backend = airflow.api.auth.backend.deny_all

Two “real” methods for authentication are currently supported for the API.
To enabled Password authentication, set the following in the configuration:

[api]
auth_backend = airflow.contrib.auth.backends.password_auth

It’s usage is similar to the Password Authentication used for the Web interface.
To enable Kerberos authentication, set the following in the configuration:

[api]
auth_backend = airflow.api.auth.backend.kerberos_auth

[kerberos]
keytab = <KEYTAB>

The Kerberos service is configured as airflow/fully.qualified.domainname@REALM. Make sure this


principal exists in the keytab file.

130 Chapter 3. Content


Airflow Documentation, Release 1.10.2

3.16 Integration
• Reverse Proxy
• Azure: Microsoft Azure
• AWS: Amazon Web Services
• Databricks
• GCP: Google Cloud Platform
• Qubole

3.16.1 Reverse Proxy

Airflow can be set up behind a reverse proxy, with the ability to set its endpoint with great flexibility.
For example, you can configure your reverse proxy to get:

https://lab.mycompany.com/myorg/airflow/

To do so, you need to set the following setting in your airflow.cfg:

base_url = http://my_host/myorg/airflow

Additionally if you use Celery Executor, you can get Flower in /myorg/flower with:

flower_url_prefix = /myorg/flower

Your reverse proxy (ex: nginx) should be configured as follow:


• pass the url and http header as it for the Airflow webserver, without any rewrite, for example:

server {
listen 80;
server_name lab.mycompany.com;

location /myorg/airflow/ {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}

• rewrite the url for the flower endpoint:

server {
listen 80;
server_name lab.mycompany.com;

location /myorg/flower/ {
rewrite ^/myorg/flower/(.*)$ /$1 break; # remove prefix from http header
proxy_pass http://localhost:5555;
proxy_set_header Host $host;
(continues on next page)

3.16. Integration 131


Airflow Documentation, Release 1.10.2

(continued from previous page)


proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}

To ensure that Airflow generates URLs with the correct scheme when running behind a TLS-terminating proxy, you
should configure the proxy to set the X-Forwarded-Proto header, and enable the ProxyFix middleware in your air-
flow.cfg:

enable_proxy_fix = True

Note: you should only enable the ProxyFix middleware when running Airflow behind a trusted proxy (AWS ELB,
nginx, etc.).

3.16.2 Azure: Microsoft Azure

Airflow has limited support for Microsoft Azure: interfaces exist only for Azure Blob Storage and Azure Data Lake.
Hook, Sensor and Operator for Blob Storage and Azure Data Lake Hook are in contrib section.

3.16.2.1 Azure Blob Storage

All classes communicate via the Window Azure Storage Blob protocol. Make sure that a Airflow connection of type
wasb exists. Authorization can be done by supplying a login (=Storage account name) and password (=KEY), or login
and SAS token in the extra field (see connection wasb_default for an example).
• WasbBlobSensor: Checks if a blob is present on Azure Blob storage.
• WasbPrefixSensor: Checks if blobs matching a prefix are present on Azure Blob storage.
• FileToWasbOperator: Uploads a local file to a container as a blob.
• WasbHook: Interface with Azure Blob Storage.

WasbBlobSensor

WasbPrefixSensor

FileToWasbOperator

WasbHook

3.16.2.2 Azure File Share

Cloud variant of a SMB file share. Make sure that a Airflow connection of type wasb exists. Authorization can be
done by supplying a login (=Storage account name) and password (=Storage account key), or login and SAS token in
the extra field (see connection wasb_default for an example).

132 Chapter 3. Content


Airflow Documentation, Release 1.10.2

AzureFileShareHook

3.16.2.3 Logging

Airflow can be configured to read and write task logs in Azure Blob Storage. See Writing Logs to Azure Blob Storage.

3.16.2.4 Azure CosmosDB

AzureCosmosDBHook communicates via the Azure Cosmos library. Make sure that a Airflow connection of type
azure_cosmos exists. Authorization can be done by supplying a login (=Endpoint uri), password (=secret key) and
extra fields database_name and collection_name to specify the default database and collection to use (see connection
azure_cosmos_default for an example).
• AzureCosmosDBHook: Interface with Azure CosmosDB.
• AzureCosmosInsertDocumentOperator: Simple operator to insert document into CosmosDB.
• AzureCosmosDocumentSensor: Simple sensor to detect document existence in CosmosDB.

AzureCosmosDBHook

AzureCosmosInsertDocumentOperator

AzureCosmosDocumentSensor

3.16.2.5 Azure Data Lake

AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection
of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client
Secret) and extra fields tenant (Tenant) and account_name (Account Name)
(see connection azure_data_lake_default for an example).
• AzureDataLakeHook: Interface with Azure Data Lake.
• AzureDataLakeStorageListOperator: Lists the files located in a specified Azure Data Lake path.
• AdlsToGoogleCloudStorageOperator: Copies files from an Azure Data Lake path to a Google Cloud Storage
bucket.

AzureDataLakeHook

AzureDataLakeStorageListOperator

AdlsToGoogleCloudStorageOperator

3.16.3 AWS: Amazon Web Services

Airflow has extensive support for Amazon Web Services. But note that the Hooks, Sensors and Operators are in the
contrib section.

3.16. Integration 133


Airflow Documentation, Release 1.10.2

3.16.3.1 AWS EMR

• EmrAddStepsOperator : Adds steps to an existing EMR JobFlow.


• EmrCreateJobFlowOperator : Creates an EMR JobFlow, reading the config from the EMR connection.
• EmrTerminateJobFlowOperator : Terminates an EMR JobFlow.
• EmrHook : Interact with AWS EMR.

EmrAddStepsOperator

class airflow.contrib.operators.emr_add_steps_operator.EmrAddStepsOperator(**kwargs)
Bases: airflow.models.BaseOperator
An operator that adds steps to an existing EMR job_flow.
Parameters
• job_flow_id (str) – id of the JobFlow to add steps to. (templated)
• aws_conn_id (str) – aws connection to uses
• steps (list) – boto3 style steps to be added to the jobflow. (templated)

EmrCreateJobFlowOperator

class airflow.contrib.operators.emr_create_job_flow_operator.EmrCreateJobFlowOperator(**kwa
Bases: airflow.models.BaseOperator
Creates an EMR JobFlow, reading the config from the EMR connection. A dictionary of JobFlow overrides can
be passed that override the config from the connection.
Parameters
• aws_conn_id (str) – aws connection to uses
• emr_conn_id (str) – emr connection to use
• job_flow_overrides (dict) – boto3 style arguments to override emr_connection ex-
tra. (templated)

EmrTerminateJobFlowOperator

class airflow.contrib.operators.emr_terminate_job_flow_operator.EmrTerminateJobFlowOperator
Bases: airflow.models.BaseOperator
Operator to terminate EMR JobFlows.
Parameters
• job_flow_id (str) – id of the JobFlow to terminate. (templated)
• aws_conn_id (str) – aws connection to uses

134 Chapter 3. Content


Airflow Documentation, Release 1.10.2

EmrHook

class airflow.contrib.hooks.emr_hook.EmrHook(emr_conn_id=None, region_name=None,


*args, **kwargs)
Bases: airflow.contrib.hooks.aws_hook.AwsHook
Interact with AWS EMR. emr_conn_id is only necessary for using the create_job_flow method.
create_job_flow(job_flow_overrides)
Creates a job flow using the config from the EMR connection. Keys of the json extra hash may have
the arguments of the boto3 run_job_flow method. Overrides for this config may be passed as the
job_flow_overrides.

3.16.3.2 AWS S3

• S3Hook : Interact with AWS S3.


• S3FileTransformOperator : Copies data from a source S3 location to a temporary location on the local filesys-
tem.
• S3ListOperator : Lists the files matching a key prefix from a S3 location.
• S3ToGoogleCloudStorageOperator : Syncs an S3 location with a Google Cloud Storage bucket.
• S3ToGoogleCloudStorageTransferOperator : Syncs an S3 bucket with a Google Cloud Storage bucket using the
GCP Storage Transfer Service.
• S3ToHiveTransfer : Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally
before loading it into a Hive table.

S3Hook

class airflow.hooks.S3_hook.S3Hook(aws_conn_id=’aws_default’, verify=None)


Bases: airflow.contrib.hooks.aws_hook.AwsHook
Interact with AWS S3, using the boto3 library.
check_for_bucket(bucket_name)
Check if bucket_name exists.
Parameters bucket_name (str) – the name of the bucket
check_for_key(key, bucket_name=None)
Checks if a key exists in a bucket
Parameters
• key (str) – S3 key that will point to the file
• bucket_name (str) – Name of the bucket in which the file is stored
check_for_prefix(bucket_name, prefix, delimiter)
Checks that a prefix exists in a bucket
Parameters
• bucket_name (str) – the name of the bucket
• prefix (str) – a key prefix
• delimiter (str) – the delimiter marks key hierarchy.

3.16. Integration 135


Airflow Documentation, Release 1.10.2

check_for_wildcard_key(wildcard_key, bucket_name=None, delimiter=”)


Checks that a key matching a wildcard expression exists in a bucket
Parameters
• wildcard_key (str) – the path to the key
• bucket_name (str) – the name of the bucket
• delimiter (str) – the delimiter marks key hierarchy
copy_object(source_bucket_key, dest_bucket_key, source_bucket_name=None,
dest_bucket_name=None, source_version_id=None)
Creates a copy of an object that is already stored in S3.
Note: the S3 connection used here needs to have access to both source and destination bucket/key.
Parameters
• source_bucket_key (str) – The key of the source object.
It can be either full s3:// style url or relative path from root level.
When it’s specified as a full s3:// url, please omit source_bucket_name.
• dest_bucket_key (str) – The key of the object to copy to.
The convention to specify dest_bucket_key is the same as source_bucket_key.
• source_bucket_name (str) – Name of the S3 bucket where the source object is in.
It should be omitted when source_bucket_key is provided as a full s3:// url.
• dest_bucket_name (str) – Name of the S3 bucket to where the object is copied.
It should be omitted when dest_bucket_key is provided as a full s3:// url.
• source_version_id (str) – Version ID of the source object (OPTIONAL)
create_bucket(bucket_name, region_name=None)
Creates an Amazon S3 bucket.
Parameters
• bucket_name (str) – The name of the bucket
• region_name (str) – The name of the aws region in which to create the bucket.
delete_objects(bucket, keys)
Parameters
• bucket (str) – Name of the bucket in which you are going to delete object(s)
• keys (str or list) – The key(s) to delete from S3 bucket.
When keys is a string, it’s supposed to be the key name of the single object to delete.
When keys is a list, it’s supposed to be the list of the keys to delete.
get_bucket(bucket_name)
Returns a boto3.S3.Bucket object
Parameters bucket_name (str) – the name of the bucket
get_key(key, bucket_name=None)
Returns a boto3.s3.Object
Parameters

136 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• key (str) – the path to the key


• bucket_name (str) – the name of the bucket
get_wildcard_key(wildcard_key, bucket_name=None, delimiter=”)
Returns a boto3.s3.Object object matching the wildcard expression
Parameters
• wildcard_key (str) – the path to the key
• bucket_name (str) – the name of the bucket
• delimiter (str) – the delimiter marks key hierarchy
list_keys(bucket_name, prefix=”, delimiter=”, page_size=None, max_items=None)
Lists keys in a bucket under prefix and not containing delimiter
Parameters
• bucket_name (str) – the name of the bucket
• prefix (str) – a key prefix
• delimiter (str) – the delimiter marks key hierarchy.
• page_size (int) – pagination size
• max_items (int) – maximum items to return
list_prefixes(bucket_name, prefix=”, delimiter=”, page_size=None, max_items=None)
Lists prefixes in a bucket under prefix
Parameters
• bucket_name (str) – the name of the bucket
• prefix (str) – a key prefix
• delimiter (str) – the delimiter marks key hierarchy.
• page_size (int) – pagination size
• max_items (int) – maximum items to return
load_bytes(bytes_data, key, bucket_name=None, replace=False, encrypt=False)
Loads bytes to S3
This is provided as a convenience to drop a string in S3. It uses the boto infrastructure to ship a file to s3.
Parameters
• bytes_data (bytes) – bytes to set as content for the key.
• key (str) – S3 key that will point to the file
• bucket_name (str) – Name of the bucket in which to store the file
• replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
• encrypt (bool) – If True, the file will be encrypted on the server-side by S3 and will
be stored in an encrypted form while at rest in S3.
load_file(filename, key, bucket_name=None, replace=False, encrypt=False)
Loads a local file to S3
Parameters
• filename (str) – name of the file to load.

3.16. Integration 137


Airflow Documentation, Release 1.10.2

• key (str) – S3 key that will point to the file


• bucket_name (str) – Name of the bucket in which to store the file
• replace (bool) – A flag to decide whether or not to overwrite the key if it already
exists. If replace is False and the key exists, an error will be raised.
• encrypt (bool) – If True, the file will be encrypted on the server-side by S3 and will
be stored in an encrypted form while at rest in S3.
load_file_obj(file_obj, key, bucket_name=None, replace=False, encrypt=False)
Loads a file object to S3
Parameters
• file_obj (file-like object) – The file-like object to set as the content for the
S3 key.
• key (str) – S3 key that will point to the file
• bucket_name (str) – Name of the bucket in which to store the file
• replace (bool) – A flag that indicates whether to overwrite the key if it already exists.
• encrypt (bool) – If True, S3 encrypts the file on the server, and the file is stored in
encrypted form at rest in S3.
load_string(string_data, key, bucket_name=None, replace=False, encrypt=False, encoding=’utf-
8’)
Loads a string to S3
This is provided as a convenience to drop a string in S3. It uses the boto infrastructure to ship a file to s3.
Parameters
• string_data (str) – string to set as content for the key.
• key (str) – S3 key that will point to the file
• bucket_name (str) – Name of the bucket in which to store the file
• replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
• encrypt (bool) – If True, the file will be encrypted on the server-side by S3 and will
be stored in an encrypted form while at rest in S3.
read_key(key, bucket_name=None)
Reads a key from S3
Parameters
• key (str) – S3 key that will point to the file
• bucket_name (str) – Name of the bucket in which the file is stored
select_key(key, bucket_name=None, expression=’SELECT * FROM S3Object’, expres-
sion_type=’SQL’, input_serialization=None, output_serialization=None)
Reads a key with S3 Select.
Parameters
• key (str) – S3 key that will point to the file
• bucket_name (str) – Name of the bucket in which the file is stored
• expression (str) – S3 Select expression
• expression_type (str) – S3 Select expression type

138 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• input_serialization (dict) – S3 Select input data serialization format


• output_serialization (dict) – S3 Select output data serialization format
Returns retrieved subset of original data by S3 Select
Return type str
See also:
For more details about S3 Select parameters: http://boto3.readthedocs.io/en/latest/reference/services/s3.
html#S3.Client.select_object_content

S3FileTransformOperator

class airflow.operators.s3_file_transform_operator.S3FileTransformOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from a source S3 location to a temporary location on the local filesystem. Runs a transformation on
this file as specified by the transformation script and uploads the output to a destination S3 location.
The locations of the source and the destination files in the local filesystem is provided as an first and second
arguments to the transformation script. The transformation script is expected to read the data from source,
transform it and write the output to the local destination file. The operator then takes over control and uploads
the local destination file to S3.
S3 Select is also available to filter the source contents. Users can omit the transformation script if S3 Select
expression is specified.
Parameters
• source_s3_key (str) – The key to be retrieved from S3. (templated)
• source_aws_conn_id (str) – source s3 connection
• source_verify (bool or str) – Whether or not to verify SSL certificates for S3
connetion. By default SSL certificates are verified. You can provide the following values:
– False: do not validate SSL certificates. SSL will still be used (unless use_ssl is
False), but SSL certificates will not be verified.
– path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You
can specify this argument if you want to use a different CA cert bundle than the one
used by botocore.
This is also applicable to dest_verify.
• dest_s3_key (str) – The key to be written from S3. (templated)
• dest_aws_conn_id (str) – destination s3 connection
• replace (bool) – Replace dest S3 key if it already exists
• transform_script (str) – location of the executable transformation script
• select_expression (str) – S3 Select expression

S3ListOperator

class airflow.contrib.operators.s3_list_operator.S3ListOperator(**kwargs)
Bases: airflow.models.BaseOperator

3.16. Integration 139


Airflow Documentation, Release 1.10.2

List all objects from the bucket with the given string prefix in name.
This operator returns a python list with the name of objects which can be used by xcom in the downstream task.
Parameters
• bucket (string) – The S3 bucket where to find the objects. (templated)
• prefix (string) – Prefix string to filters the objects whose name begin with such prefix.
(templated)
• delimiter (string) – the delimiter marks key hierarchy. (templated)
• aws_conn_id (string) – The connection ID to use when connecting to S3 storage.
Parame verify Whether or not to verify SSL certificates for S3 connection. By default SSL certifi-
cates are verified. You can provide the following values: - False: do not validate SSL certificates.
SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.

• path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You can specify


this argument if you want to use a different CA cert bundle than the one used by botocore.

Example: The following operator would list all the files (excluding subfolders) from the S3 customers/
2018/04/ key in the data bucket.

s3_file = S3ListOperator(
task_id='list_3s_files',
bucket='data',
prefix='customers/2018/04/',
delimiter='/',
aws_conn_id='aws_customers_conn'
)

S3ToGoogleCloudStorageOperator

class airflow.contrib.operators.s3_to_gcs_operator.S3ToGoogleCloudStorageOperator(**kwargs)
Bases: airflow.contrib.operators.s3_list_operator.S3ListOperator
Synchronizes an S3 key, possibly a prefix, with a Google Cloud Storage destination path.
Parameters
• bucket (string) – The S3 bucket where to find the objects. (templated)
• prefix (string) – Prefix string which filters objects whose name begin with such prefix.
(templated)
• delimiter (string) – the delimiter marks key hierarchy. (templated)
• aws_conn_id (string) – The source S3 connection
• dest_gcs_conn_id (string) – The destination connection ID to use when connecting
to Google Cloud Storage.
• dest_gcs (string) – The destination Google Cloud Storage bucket and prefix where
you want to store the files. (templated)
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

140 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• replace (bool) – Whether you want to replace existing destination files or not.
Parame verify Whether or not to verify SSL certificates for S3 connection. By default SSL certifi-
cates are verified. You can provide the following values: - False: do not validate SSL certificates.
SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.

• path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You can specify


this argument if you want to use a different CA cert bundle than the one used by botocore.

Example:

s3_to_gcs_op = S3ToGoogleCloudStorageOperator(
task_id='s3_to_gcs_example',
bucket='my-s3-bucket',
prefix='data/customers-201804',
dest_gcs_conn_id='google_cloud_default',
dest_gcs='gs://my.gcs.bucket/some/customers/',
replace=False,
dag=my-dag)

Note that bucket, prefix, delimiter and dest_gcs are templated, so you can use variables in them if
you wish.

S3ToGoogleCloudStorageTransferOperator

S3ToHiveTransfer

class airflow.operators.s3_to_hive_operator.S3ToHiveTransfer(**kwargs)
Bases: airflow.models.BaseOperator
Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally before loading it into
a Hive table. If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE
statements are generated. Hive data types are inferred from the cursor’s metadata from.
Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization
format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use
this operator only to stage the data into a temporary table before loading it into its final destination using a
HiveOperator.
Parameters
• s3_key (str) – The key to be retrieved from S3. (templated)
• field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive
types as values
• hive_table (str) – target Hive table, use dot notation to target a specific database.
(templated)
• create (bool) – whether to create the table if it doesn’t exist
• recreate (bool) – whether to drop and recreate the table at every execution
• partition (dict) – target partition as a dict of partition columns and values. (templated)
• headers (bool) – whether the file contains column names on the first line

3.16. Integration 141


Airflow Documentation, Release 1.10.2

• check_headers (bool) – whether the column names on the first line should be checked
against the keys of field_dict
• wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard
pattern
• delimiter (str) – field delimiter in the file
• aws_conn_id (str) – source s3 connection
• hive_cli_conn_id (str) – destination hive connection
• input_compressed (bool) – Boolean to determine if file decompression is required to
process headers
• tblproperties (dict) – TBLPROPERTIES of the hive table being created
• select_expression (str) – S3 Select expression
Parame verify Whether or not to verify SSL certificates for S3 connection. By default SSL certifi-
cates are verified. You can provide the following values: - False: do not validate SSL certificates.
SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.

• path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You can specify


this argument if you want to use a different CA cert bundle than the one used by botocore.

3.16.3.3 AWS EC2 Container Service

• ECSOperator : Execute a task on AWS EC2 Container Service.

ECSOperator

class airflow.contrib.operators.ecs_operator.ECSOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a task on AWS EC2 Container Service
Parameters
• task_definition (str) – the task definition name on EC2 Container Service
• cluster (str) – the cluster name on EC2 Container Service
• overrides (dict) – the same parameter that boto3 will receive (templated): http://boto3.
readthedocs.org/en/latest/reference/services/ecs.html#ECS.Client.run_task
• aws_conn_id (str) – connection id of AWS credentials / region name. If None, creden-
tial boto3 strategy will be used (http://boto3.readthedocs.io/en/latest/guide/configuration.
html).
• region_name (str) – region name to use in AWS Hook. Override the region_name in
connection (if provided)
• launch_type (str) – the launch type on which to run your task (‘EC2’ or ‘FARGATE’)

3.16.3.4 AWS Batch Service

• AWSBatchOperator : Execute a task on AWS Batch Service.

142 Chapter 3. Content


Airflow Documentation, Release 1.10.2

AWSBatchOperator

class airflow.contrib.operators.awsbatch_operator.AWSBatchOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a job on AWS Batch Service
Parameters
• job_name (str) – the name for the job that will run on AWS Batch
• job_definition (str) – the job definition name on AWS Batch
• job_queue (str) – the queue name on AWS Batch
• overrides (dict) – the same parameter that boto3 will receive on con-
tainerOverrides (templated): http://boto3.readthedocs.io/en/latest/reference/services/batch.
html#submit_job
• max_retries (int) – exponential backoff retries while waiter is not merged, 4200 = 48
hours
• aws_conn_id (str) – connection id of AWS credentials / region name. If None, creden-
tial boto3 strategy will be used (http://boto3.readthedocs.io/en/latest/guide/configuration.
html).
• region_name (str) – region name to use in AWS Hook. Override the region_name in
connection (if provided)

3.16.3.5 AWS RedShift

• AwsRedshiftClusterSensor : Waits for a Redshift cluster to reach a specific status.


• RedshiftHook : Interact with AWS Redshift, using the boto3 library.
• RedshiftToS3Transfer : Executes an unload command to S3 as CSV with or without headers.
• S3ToRedshiftTransfer : Executes an copy command from S3 as CSV with or without headers.

AwsRedshiftClusterSensor

class airflow.contrib.sensors.aws_redshift_cluster_sensor.AwsRedshiftClusterSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a Redshift cluster to reach a specific status.
Parameters
• cluster_identifier (str) – The identifier for the cluster being pinged.
• target_status (str) – The cluster status desired.
poke(context)
Function that the sensors defined while deriving this class should override.

RedshiftHook

class airflow.contrib.hooks.redshift_hook.RedshiftHook(aws_conn_id=’aws_default’,
verify=None)
Bases: airflow.contrib.hooks.aws_hook.AwsHook

3.16. Integration 143


Airflow Documentation, Release 1.10.2

Interact with AWS Redshift, using the boto3 library


cluster_status(cluster_identifier)
Return status of a cluster
Parameters cluster_identifier (str) – unique identifier of a cluster
create_cluster_snapshot(snapshot_identifier, cluster_identifier)
Creates a snapshot of a cluster
Parameters
• snapshot_identifier (str) – unique identifier for a snapshot of a cluster
• cluster_identifier (str) – unique identifier of a cluster
delete_cluster(cluster_identifier, skip_final_cluster_snapshot=True, fi-
nal_cluster_snapshot_identifier=”)
Delete a cluster and optionally create a snapshot
Parameters
• cluster_identifier (str) – unique identifier of a cluster
• skip_final_cluster_snapshot (bool) – determines cluster snapshot creation
• final_cluster_snapshot_identifier (str) – name of final cluster snapshot
describe_cluster_snapshots(cluster_identifier)
Gets a list of snapshots for a cluster
Parameters cluster_identifier (str) – unique identifier of a cluster
restore_from_cluster_snapshot(cluster_identifier, snapshot_identifier)
Restores a cluster from its snapshot
Parameters
• cluster_identifier (str) – unique identifier of a cluster
• snapshot_identifier (str) – unique identifier for a snapshot of a cluster

RedshiftToS3Transfer

S3ToRedshiftTransfer

3.16.3.6 Amazon SageMaker

For more instructions on using Amazon SageMaker in Airflow, please see the SageMaker Python SDK README.
• SageMakerHook : Interact with Amazon SageMaker.
• SageMakerTrainingOperator : Create a SageMaker training job.
• SageMakerTuningOperator : Create a SageMaker tuning job.
• SageMakerModelOperator : Create a SageMaker model.
• SageMakerTransformOperator : Create a SageMaker transform job.
• SageMakerEndpointConfigOperator : Create a SageMaker endpoint config.
• SageMakerEndpointOperator : Create a SageMaker endpoint.

144 Chapter 3. Content


Airflow Documentation, Release 1.10.2

SageMakerHook

class airflow.contrib.hooks.sagemaker_hook.SageMakerHook(*args, **kwargs)


Bases: airflow.contrib.hooks.aws_hook.AwsHook
Interact with Amazon SageMaker.
check_s3_url(s3url)
Check if an S3 URL exists
Parameters s3url (str) – S3 url
Return type bool
check_status(job_name, key, describe_function, check_interval, max_ingestion_time,
non_terminal_states=None)
Check status of a SageMaker job
Parameters
• job_name (str) – name of the job to check status
• key (str) – the key of the response dict that points to the state
• describe_function (python callable) – the function used to retrieve the status
• args – the arguments for the function
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
• non_terminal_states (set) – the set of nonterminal states
Returns response of describe call after job is done
check_training_config(training_config)
Check if a training configuration is valid
Parameters training_config (dict) – training_config
Returns None
check_training_status_with_log(job_name, non_terminal_states, failed_states,
wait_for_completion, check_interval,
max_ingestion_time)
Display the logs for a given training job, optionally tailing them until the job is complete.
Parameters
• job_name (str) – name of the training job to check status and display logs for
• non_terminal_states (set) – the set of non_terminal states
• failed_states (set) – the set of failed states
• wait_for_completion (bool) – Whether to keep looking for new log entries until
the job completes
• check_interval (int) – The interval in seconds between polling for new log entries
and job completion

3.16. Integration 145


Airflow Documentation, Release 1.10.2

• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-


Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns None
check_tuning_config(tuning_config)
Check if a tuning configuration is valid
Parameters tuning_config (dict) – tuning_config
Returns None
configure_s3_resources(config)
Extract the S3 operations from the configuration and execute them.
Parameters config (dict) – config of SageMaker operation
Return type dict
create_endpoint(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create an endpoint
Parameters
• config (dict) – the config for endpoint
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to endpoint creation
create_endpoint_config(config)
Create an endpoint config
Parameters config (dict) – the config for endpoint-config
Returns A response to endpoint config creation
create_model(config)
Create a model job
Parameters config (dict) – the config for model
Returns A response to model creation
create_training_job(config, wait_for_completion=True, print_log=True, check_interval=30,
max_ingestion_time=None)
Create a training job
Parameters
• config (dict) – the config for training
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job

146 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-


Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to training job creation
create_transform_job(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create a transform job
Parameters
• config (dict) – the config for transform job
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to transform job creation
create_tuning_job(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create a tuning job
Parameters
• config (dict) – the config for tuning
• wait_for_completion – if the program should keep running until job finishes
• wait_for_completion – bool
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to tuning job creation
describe_endpoint(name)
Parameters name (string) – the name of the endpoint
Returns A dict contains all the endpoint info
describe_endpoint_config(name)
Return the endpoint config info associated with the name
Parameters name (string) – the name of the endpoint config
Returns A dict contains all the endpoint config info
describe_model(name)
Return the SageMaker model info associated with the name
Parameters name (string) – the name of the SageMaker model
Returns A dict contains all the model info

3.16. Integration 147


Airflow Documentation, Release 1.10.2

describe_training_job(name)
Return the training job info associated with the name
Parameters name (str) – the name of the training job
Returns A dict contains all the training job info
describe_training_job_with_log(job_name, positions, stream_names, instance_count, state,
last_description, last_describe_job_call)
Return the training job info associated with job_name and print CloudWatch logs
describe_transform_job(name)
Return the transform job info associated with the name
Parameters name (string) – the name of the transform job
Returns A dict contains all the transform job info
describe_tuning_job(name)
Return the tuning job info associated with the name
Parameters name (string) – the name of the tuning job
Returns A dict contains all the tuning job info
get_conn()
Establish an AWS connection for SageMaker
Return type SageMaker.Client
get_log_conn()
Establish an AWS connection for retrieving logs during training
Return type CloudWatchLog.Client
log_stream(log_group, stream_name, start_time=0, skip=0)
A generator for log items in a single stream. This will yield all the items that are available at the current
moment.
Parameters
• log_group (str) – The name of the log group.
• stream_name (str) – The name of the specific stream.
• start_time (int) – The time stamp value to start reading the logs from (default: 0).
• skip (int) – The number of log entries to skip at the start (default: 0). This is for when
there are multiple entries at the same timestamp.
Return type dict
Returns

A CloudWatch log event with the following key-value pairs:


’timestamp’ (int): The time in milliseconds of the event.
’message’ (str): The log event data.
’ingestionTime’ (int): The time in milliseconds the event was ingested.

multi_stream_iter(log_group, streams, positions=None)


Iterate over the available events coming from a set of log streams in a single log group interleaving the
events from each stream so they’re yielded in timestamp order.

148 Chapter 3. Content


Airflow Documentation, Release 1.10.2

Parameters
• log_group (str) – The name of the log group.
• streams (list) – A list of the log stream names. The position of the stream in this list
is the stream number.
• positions (list) – A list of pairs of (timestamp, skip) which represents the last record
read from each stream.
Returns A tuple of (stream number, cloudwatch log event).
tar_and_s3_upload(path, key, bucket)
Tar the local file or directory and upload to s3
Parameters
• path (str) – local file or directory
• key (str) – s3 key
• bucket (str) – s3 bucket
Returns None
update_endpoint(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Update an endpoint
Parameters
• config (dict) – the config for endpoint
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to endpoint update

SageMakerTrainingOperator

class airflow.contrib.operators.sagemaker_training_operator.SageMakerTrainingOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker training job.
This operator returns The ARN of the training job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a training job (templated).
For details of the configuration parameter see SageMaker.Client.
create_training_job()
• aws_conn_id (str) – The AWS connection ID to use.

3.16. Integration 149


Airflow Documentation, Release 1.10.2

• wait_for_completion (bool) – If wait is set to True, the time interval, in seconds,


that the operation waits to check the status of the training job.
• print_log (bool) – if the operator should print the cloudwatch log during training
• check_interval (int) – if wait is set to be true, this is the time interval in seconds
which the operator will check the status of the training job
• max_ingestion_time (int) – If wait is set to True, the operation fails if the training
job doesn’t finish within max_ingestion_time seconds. If you set this parameter to None,
the operation does not timeout.

SageMakerTuningOperator

class airflow.contrib.operators.sagemaker_tuning_operator.SageMakerTuningOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker hyperparameter tuning job.
This operator returns The ARN of the tuning job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a tuning job (templated).
For details of the configuration parameter see SageMaker.Client.
create_hyper_parameter_tuning_job()
• aws_conn_id (str) – The AWS connection ID to use.
• wait_for_completion (bool) – Set to True to wait until the tuning job finishes.
• check_interval (int) – If wait is set to True, the time interval, in seconds, that this
operation waits to check the status of the tuning job.
• max_ingestion_time (int) – If wait is set to True, the operation fails if the tuning job
doesn’t finish within max_ingestion_time seconds. If you set this parameter to None, the
operation does not timeout.

SageMakerModelOperator

class airflow.contrib.operators.sagemaker_model_operator.SageMakerModelOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker model.
This operator returns The ARN of the model created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create a model.
For details of the configuration parameter see SageMaker.Client.
create_model()
• aws_conn_id (str) – The AWS connection ID to use.

150 Chapter 3. Content


Airflow Documentation, Release 1.10.2

SageMakerTransformOperator

class airflow.contrib.operators.sagemaker_transform_operator.SageMakerTransformOperator(**k
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker transform job.
This operator returns The ARN of the model created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a transform job (templated).
If you need to create a SageMaker transform job based on an existed SageMaker model:

config = transform_config

If you need to create both SageMaker model and SageMaker Transform job:

config = {
'Model': model_config,
'Transform': transform_config
}

For details of the configuration parameter of transform_config see SageMaker.Client.


create_transform_job()
For details of the configuration parameter of model_config, See: SageMaker.Client.
create_model()
• aws_conn_id (string) – The AWS connection ID to use.
• wait_for_completion (bool) – Set to True to wait until the transform job finishes.
• check_interval (int) – If wait is set to True, the time interval, in seconds, that this
operation waits to check the status of the transform job.
• max_ingestion_time (int) – If wait is set to True, the operation fails if the transform
job doesn’t finish within max_ingestion_time seconds. If you set this parameter to None,
the operation does not timeout.

SageMakerEndpointConfigOperator

class airflow.contrib.operators.sagemaker_endpoint_config_operator.SageMakerEndpointConfigO
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint config.
This operator returns The ARN of the endpoint config created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint config.
For details of the configuration parameter see SageMaker.Client.
create_endpoint_config()
• aws_conn_id (str) – The AWS connection ID to use.

3.16. Integration 151


Airflow Documentation, Release 1.10.2

SageMakerEndpointOperator

class airflow.contrib.operators.sagemaker_endpoint_operator.SageMakerEndpointOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint.
This operator returns The ARN of the endpoint created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint.
If you need to create a SageMaker endpoint based on an existed SageMaker model and an
existed SageMaker endpoint config:

config = endpoint_configuration;

If you need to create all of SageMaker model, SageMaker endpoint-config and SageMaker
endpoint:

config = {
'Model': model_configuration,
'EndpointConfig': endpoint_config_configuration,
'Endpoint': endpoint_configuration
}

For details of the configuration parameter of model_configuration see SageMaker.


Client.create_model()
For details of the configuration parameter of endpoint_config_configuration see
SageMaker.Client.create_endpoint_config()
For details of the configuration parameter of endpoint_configuration see SageMaker.
Client.create_endpoint()
• aws_conn_id (str) – The AWS connection ID to use.
• wait_for_completion (bool) – Whether the operator should wait until the endpoint
creation finishes.
• check_interval (int) – If wait is set to True, this is the time interval, in seconds, that
this operation waits before polling the status of the endpoint creation.
• max_ingestion_time (int) – If wait is set to True, this operation fails if the endpoint
creation doesn’t finish within max_ingestion_time seconds. If you set this parameter to
None it never times out.
• operation (str) – Whether to create an endpoint or update an endpoint. Must be either
‘create or ‘update’.

3.16.3.7 Amazon SageMaker

For more instructions on using Amazon SageMaker in Airflow, please see the SageMaker Python SDK README.
• SageMakerHook : Interact with Amazon SageMaker.
• SageMakerTrainingOperator : Create a SageMaker training job.
• SageMakerTuningOperator : Create a SageMaker tuning job.

152 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• SageMakerModelOperator : Create a SageMaker model.


• SageMakerTransformOperator : Create a SageMaker transform job.
• SageMakerEndpointConfigOperator : Create a SageMaker endpoint config.
• SageMakerEndpointOperator : Create a SageMaker endpoint.

SageMakerHook

class airflow.contrib.hooks.sagemaker_hook.SageMakerHook(*args, **kwargs)


Bases: airflow.contrib.hooks.aws_hook.AwsHook
Interact with Amazon SageMaker.
check_s3_url(s3url)
Check if an S3 URL exists
Parameters s3url (str) – S3 url
Return type bool
check_status(job_name, key, describe_function, check_interval, max_ingestion_time,
non_terminal_states=None)
Check status of a SageMaker job
Parameters
• job_name (str) – name of the job to check status
• key (str) – the key of the response dict that points to the state
• describe_function (python callable) – the function used to retrieve the status
• args – the arguments for the function
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
• non_terminal_states (set) – the set of nonterminal states
Returns response of describe call after job is done
check_training_config(training_config)
Check if a training configuration is valid
Parameters training_config (dict) – training_config
Returns None
check_training_status_with_log(job_name, non_terminal_states, failed_states,
wait_for_completion, check_interval,
max_ingestion_time)
Display the logs for a given training job, optionally tailing them until the job is complete.
Parameters
• job_name (str) – name of the training job to check status and display logs for
• non_terminal_states (set) – the set of non_terminal states
• failed_states (set) – the set of failed states

3.16. Integration 153


Airflow Documentation, Release 1.10.2

• wait_for_completion (bool) – Whether to keep looking for new log entries until
the job completes
• check_interval (int) – The interval in seconds between polling for new log entries
and job completion
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns None
check_tuning_config(tuning_config)
Check if a tuning configuration is valid
Parameters tuning_config (dict) – tuning_config
Returns None
configure_s3_resources(config)
Extract the S3 operations from the configuration and execute them.
Parameters config (dict) – config of SageMaker operation
Return type dict
create_endpoint(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create an endpoint
Parameters
• config (dict) – the config for endpoint
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to endpoint creation
create_endpoint_config(config)
Create an endpoint config
Parameters config (dict) – the config for endpoint-config
Returns A response to endpoint config creation
create_model(config)
Create a model job
Parameters config (dict) – the config for model
Returns A response to model creation
create_training_job(config, wait_for_completion=True, print_log=True, check_interval=30,
max_ingestion_time=None)
Create a training job
Parameters
• config (dict) – the config for training

154 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to training job creation
create_transform_job(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create a transform job
Parameters
• config (dict) – the config for transform job
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to transform job creation
create_tuning_job(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create a tuning job
Parameters
• config (dict) – the config for tuning
• wait_for_completion – if the program should keep running until job finishes
• wait_for_completion – bool
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to tuning job creation
describe_endpoint(name)
Parameters name (string) – the name of the endpoint
Returns A dict contains all the endpoint info
describe_endpoint_config(name)
Return the endpoint config info associated with the name
Parameters name (string) – the name of the endpoint config
Returns A dict contains all the endpoint config info

3.16. Integration 155


Airflow Documentation, Release 1.10.2

describe_model(name)
Return the SageMaker model info associated with the name
Parameters name (string) – the name of the SageMaker model
Returns A dict contains all the model info
describe_training_job(name)
Return the training job info associated with the name
Parameters name (str) – the name of the training job
Returns A dict contains all the training job info
describe_training_job_with_log(job_name, positions, stream_names, instance_count, state,
last_description, last_describe_job_call)
Return the training job info associated with job_name and print CloudWatch logs
describe_transform_job(name)
Return the transform job info associated with the name
Parameters name (string) – the name of the transform job
Returns A dict contains all the transform job info
describe_tuning_job(name)
Return the tuning job info associated with the name
Parameters name (string) – the name of the tuning job
Returns A dict contains all the tuning job info
get_conn()
Establish an AWS connection for SageMaker
Return type SageMaker.Client
get_log_conn()
Establish an AWS connection for retrieving logs during training
Return type CloudWatchLog.Client
log_stream(log_group, stream_name, start_time=0, skip=0)
A generator for log items in a single stream. This will yield all the items that are available at the current
moment.
Parameters
• log_group (str) – The name of the log group.
• stream_name (str) – The name of the specific stream.
• start_time (int) – The time stamp value to start reading the logs from (default: 0).
• skip (int) – The number of log entries to skip at the start (default: 0). This is for when
there are multiple entries at the same timestamp.
Return type dict
Returns

A CloudWatch log event with the following key-value pairs:


’timestamp’ (int): The time in milliseconds of the event.
’message’ (str): The log event data.
’ingestionTime’ (int): The time in milliseconds the event was ingested.

156 Chapter 3. Content


Airflow Documentation, Release 1.10.2

multi_stream_iter(log_group, streams, positions=None)


Iterate over the available events coming from a set of log streams in a single log group interleaving the
events from each stream so they’re yielded in timestamp order.
Parameters
• log_group (str) – The name of the log group.
• streams (list) – A list of the log stream names. The position of the stream in this list
is the stream number.
• positions (list) – A list of pairs of (timestamp, skip) which represents the last record
read from each stream.
Returns A tuple of (stream number, cloudwatch log event).
tar_and_s3_upload(path, key, bucket)
Tar the local file or directory and upload to s3
Parameters
• path (str) – local file or directory
• key (str) – s3 key
• bucket (str) – s3 bucket
Returns None
update_endpoint(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Update an endpoint
Parameters
• config (dict) – the config for endpoint
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to endpoint update

SageMakerTrainingOperator

class airflow.contrib.operators.sagemaker_training_operator.SageMakerTrainingOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker training job.
This operator returns The ARN of the training job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a training job (templated).
For details of the configuration parameter see SageMaker.Client.
create_training_job()

3.16. Integration 157


Airflow Documentation, Release 1.10.2

• aws_conn_id (str) – The AWS connection ID to use.


• wait_for_completion (bool) – If wait is set to True, the time interval, in seconds,
that the operation waits to check the status of the training job.
• print_log (bool) – if the operator should print the cloudwatch log during training
• check_interval (int) – if wait is set to be true, this is the time interval in seconds
which the operator will check the status of the training job
• max_ingestion_time (int) – If wait is set to True, the operation fails if the training
job doesn’t finish within max_ingestion_time seconds. If you set this parameter to None,
the operation does not timeout.

SageMakerTuningOperator

class airflow.contrib.operators.sagemaker_tuning_operator.SageMakerTuningOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker hyperparameter tuning job.
This operator returns The ARN of the tuning job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a tuning job (templated).
For details of the configuration parameter see SageMaker.Client.
create_hyper_parameter_tuning_job()
• aws_conn_id (str) – The AWS connection ID to use.
• wait_for_completion (bool) – Set to True to wait until the tuning job finishes.
• check_interval (int) – If wait is set to True, the time interval, in seconds, that this
operation waits to check the status of the tuning job.
• max_ingestion_time (int) – If wait is set to True, the operation fails if the tuning job
doesn’t finish within max_ingestion_time seconds. If you set this parameter to None, the
operation does not timeout.

SageMakerModelOperator

class airflow.contrib.operators.sagemaker_model_operator.SageMakerModelOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker model.
This operator returns The ARN of the model created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create a model.
For details of the configuration parameter see SageMaker.Client.
create_model()
• aws_conn_id (str) – The AWS connection ID to use.

158 Chapter 3. Content


Airflow Documentation, Release 1.10.2

SageMakerTransformOperator

class airflow.contrib.operators.sagemaker_transform_operator.SageMakerTransformOperator(**k
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker transform job.
This operator returns The ARN of the model created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a transform job (templated).
If you need to create a SageMaker transform job based on an existed SageMaker model:

config = transform_config

If you need to create both SageMaker model and SageMaker Transform job:

config = {
'Model': model_config,
'Transform': transform_config
}

For details of the configuration parameter of transform_config see SageMaker.Client.


create_transform_job()
For details of the configuration parameter of model_config, See: SageMaker.Client.
create_model()
• aws_conn_id (string) – The AWS connection ID to use.
• wait_for_completion (bool) – Set to True to wait until the transform job finishes.
• check_interval (int) – If wait is set to True, the time interval, in seconds, that this
operation waits to check the status of the transform job.
• max_ingestion_time (int) – If wait is set to True, the operation fails if the transform
job doesn’t finish within max_ingestion_time seconds. If you set this parameter to None,
the operation does not timeout.

SageMakerEndpointConfigOperator

class airflow.contrib.operators.sagemaker_endpoint_config_operator.SageMakerEndpointConfigO
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint config.
This operator returns The ARN of the endpoint config created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint config.
For details of the configuration parameter see SageMaker.Client.
create_endpoint_config()
• aws_conn_id (str) – The AWS connection ID to use.

3.16. Integration 159


Airflow Documentation, Release 1.10.2

SageMakerEndpointOperator

class airflow.contrib.operators.sagemaker_endpoint_operator.SageMakerEndpointOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint.
This operator returns The ARN of the endpoint created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint.
If you need to create a SageMaker endpoint based on an existed SageMaker model and an
existed SageMaker endpoint config:

config = endpoint_configuration;

If you need to create all of SageMaker model, SageMaker endpoint-config and SageMaker
endpoint:

config = {
'Model': model_configuration,
'EndpointConfig': endpoint_config_configuration,
'Endpoint': endpoint_configuration
}

For details of the configuration parameter of model_configuration see SageMaker.


Client.create_model()
For details of the configuration parameter of endpoint_config_configuration see
SageMaker.Client.create_endpoint_config()
For details of the configuration parameter of endpoint_configuration see SageMaker.
Client.create_endpoint()
• aws_conn_id (str) – The AWS connection ID to use.
• wait_for_completion (bool) – Whether the operator should wait until the endpoint
creation finishes.
• check_interval (int) – If wait is set to True, this is the time interval, in seconds, that
this operation waits before polling the status of the endpoint creation.
• max_ingestion_time (int) – If wait is set to True, this operation fails if the endpoint
creation doesn’t finish within max_ingestion_time seconds. If you set this parameter to
None it never times out.
• operation (str) – Whether to create an endpoint or update an endpoint. Must be either
‘create or ‘update’.

3.16.4 Databricks

Databricks has contributed an Airflow operator which enables submitting runs to the Databricks platform. Internally
the operator talks to the api/2.0/jobs/runs/submit endpoint.

160 Chapter 3. Content


Airflow Documentation, Release 1.10.2

3.16.4.1 DatabricksSubmitRunOperator

class airflow.contrib.operators.databricks_operator.DatabricksSubmitRunOperator(**kwargs)
Bases: airflow.models.BaseOperator
Submits a Spark job run to Databricks using the api/2.0/jobs/runs/submit API endpoint.
There are two ways to instantiate this operator.
In the first way, you can take the JSON payload that you typically use to call the api/2.0/jobs/runs/
submit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json param-
eter. For example

json = {
'new_cluster': {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
},
'notebook_task': {
'notebook_path': '/Users/[email protected]/PrepareData',
},
}
notebook_run = DatabricksSubmitRunOperator(task_id='notebook_run', json=json)

Another way to accomplish the same thing is to use the named parameters of the
DatabricksSubmitRunOperator directly. Note that there is exactly one named parameter for
each top level parameter in the runs/submit endpoint. In this method, your code would look like this:

new_cluster = {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
}
notebook_task = {
'notebook_path': '/Users/[email protected]/PrepareData',
}
notebook_run = DatabricksSubmitRunOperator(
task_id='notebook_run',
new_cluster=new_cluster,
notebook_task=notebook_task)

In the case where both the json parameter AND the named parameters are provided, they will be merged together.
If there are conflicts during the merge, the named parameters will take precedence and override the top level
json keys.
Currently the named parameters that DatabricksSubmitRunOperator supports are
• spark_jar_task
• notebook_task
• new_cluster
• existing_cluster_id
• libraries
• run_name
• timeout_seconds

Parameters

3.16. Integration 161


Airflow Documentation, Release 1.10.2

• json (dict) – A JSON object containing API parameters which will be passed directly
to the api/2.0/jobs/runs/submit endpoint. The other named parameters (i.e.
spark_jar_task, notebook_task..) to this operator will be merged with this json
dictionary if they are provided. If there are conflicts during the merge, the named parameters
will take precedence and override the top level json keys. (templated)
See also:
For more information about templating see Jinja Templating. https://docs.databricks.com/
api/latest/jobs.html#runs-submit
• spark_jar_task (dict) – The main class and parameters for the JAR task. Note
that the actual JAR is specified in the libraries. EITHER spark_jar_task OR
notebook_task should be specified. This field will be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobssparkjartask
• notebook_task (dict) – The notebook path and parameters for the notebook task.
EITHER spark_jar_task OR notebook_task should be specified. This field will
be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsnotebooktask
• new_cluster (dict) – Specs for a new cluster on which this task will be run. EITHER
new_cluster OR existing_cluster_id should be specified. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsclusterspecnewcluster
• existing_cluster_id (string) – ID for existing cluster on which to run this task.
EITHER new_cluster OR existing_cluster_id should be specified. This field
will be templated.
• libraries (list of dicts) – Libraries which this run will use. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/libraries.html#managedlibrarieslibrary
• run_name (string) – The run name used for this task. By default this will be set
to the Airflow task_id. This task_id is a required parameter of the superclass
BaseOperator. This field will be templated.
• timeout_seconds (int32) – The timeout for this run. By default a value of 0 is used
which means to have no timeout. This field will be templated.
• databricks_conn_id (string) – The name of the Airflow connection to use. By
default and in the common case this will be databricks_default. To use token based
authentication, provide the key token in the extra field for the connection.
• polling_period_seconds (int) – Controls the rate which we poll for the result of
this run. By default the operator will poll every 30 seconds.
• databricks_retry_limit (int) – Amount of times retry if the Databricks backend
is unreachable. Its value must be greater than or equal to 1.

162 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• databricks_retry_delay (float) – Number of seconds to wait between retries (it


might be a floating point number).
• do_xcom_push (boolean) – Whether we should push run_id and run_page_url to xcom.

3.16.5 GCP: Google Cloud Platform

Airflow has extensive support for the Google Cloud Platform. But note that most Hooks and Operators are in the
contrib section. Meaning that they have a beta status, meaning that they can have breaking changes between minor
releases.
See the GCP connection type documentation to configure connections to GCP.

3.16.5.1 Logging

Airflow can be configured to read and write task logs in Google Cloud Storage. See Writing Logs to Google Cloud
Storage.

3.16.5.2 GoogleCloudBaseHook

class airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook(gcp_conn_id=’google_cloud_defaul
dele-
gate_to=None)
Bases: airflow.hooks.base_hook.BaseHook, airflow.utils.log.logging_mixin.
LoggingMixin
A base hook for Google cloud-related hooks. Google cloud has a shared REST API client that is built in the
same way no matter which service you use. This class helps construct and authorize the credentials needed to
then call googleapiclient.discovery.build() to actually discover and build a client for a Google cloud service.
The class also contains some miscellaneous helper functions.
All hook derived from this base hook use the ‘Google Cloud Platform’ connection type. Three ways of authen-
tication are supported:
Default credentials: Only the ‘Project Id’ is required. You’ll need to have set up default credentials, such
as by the GOOGLE_APPLICATION_DEFAULT environment variable or from the metadata server on Google
Compute Engine.
JSON key file: Specify ‘Project Id’, ‘Keyfile Path’ and ‘Scope’.
Legacy P12 key files are not supported.
JSON data provided in the UI: Specify ‘Keyfile JSON’.
static fallback_to_default_project_id(func)
Decorator that provides fallback for Google Cloud Platform project id. If the project is None it will be
replaced with the project_id from the service account the Hook is authenticated with. Project id can be
specified either via project_id kwarg or via first parameter in positional args.
Parameters func – function to wrap
Returns result of the function call

3.16. Integration 163


Airflow Documentation, Release 1.10.2

3.16.5.3 BigQuery

BigQuery Operators

• BigQueryCheckOperator : Performs checks against a SQL query that will return a single row with different
values.
• BigQueryValueCheckOperator : Performs a simple value check using SQL code.
• BigQueryIntervalCheckOperator : Checks that the values of metrics given as SQL expressions are within a
certain tolerance of the ones from days_back before.
• BigQueryGetDataOperator : Fetches the data from a BigQuery table and returns data in a python list
• BigQueryCreateEmptyDatasetOperator : Creates an empty BigQuery dataset.
• BigQueryCreateEmptyTableOperator : Creates a new, empty table in the specified BigQuery dataset optionally
with schema.
• BigQueryCreateExternalTableOperator : Creates a new, external table in the dataset with the data in Google
Cloud Storage.
• BigQueryDeleteDatasetOperator : Deletes an existing BigQuery dataset.
• BigQueryTableDeleteOperator : Deletes an existing BigQuery table.
• BigQueryOperator : Executes BigQuery SQL queries in a specific BigQuery database.
• BigQueryToBigQueryOperator : Copy a BigQuery table to another BigQuery table.
• BigQueryToCloudStorageOperator : Transfers a BigQuery table to a Google Cloud Storage bucket

BigQueryCheckOperator

class airflow.contrib.operators.bigquery_check_operator.BigQueryCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.CheckOperator
Performs checks against BigQuery. The BigQueryCheckOperator expects a sql query that will return a
single row. Each value on that first row is evaluated using python bool casting. If any of the values return
False the check is failed and errors out.
Note that Python bool casting evals the following as False:
• False
• 0
• Empty string ("")
• Empty list ([])
• Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo, it will fail only if the count == 0. You can craft much
more complex query that could, for instance, check that the table has the same number of rows as the source
table upstream, or that the count of today’s partition is greater than yesterday’s partition, or that a set of metrics
are less than 3 standard deviation for the 7 day average.
This operator can be used as a data quality check in your pipeline, and depending on where you put it in your
DAG, you have the choice to stop the critical path, preventing from publishing dubious data, or on the side and
receive email alterts without stopping the progress of the DAG.
Parameters

164 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• sql (string) – the sql to be executed


• bigquery_conn_id (string) – reference to the BigQuery database
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).

BigQueryValueCheckOperator

class airflow.contrib.operators.bigquery_check_operator.BigQueryValueCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.ValueCheckOperator
Performs a simple value check using sql code.
Parameters
• sql (string) – the sql to be executed
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).

BigQueryIntervalCheckOperator

class airflow.contrib.operators.bigquery_check_operator.BigQueryIntervalCheckOperator(**kwa
Bases: airflow.operators.check_operator.IntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from
days_back before.
This method constructs a query like so

SELECT {metrics_threshold_dict_key} FROM {table}


WHERE {date_filter_column}=<date>

Parameters
• table (str) – the table name
• days_back (int) – number of days between ds and the ds we want to check against.
Defaults to 7 days
• metrics_threshold (dict) – a dictionary of ratios indexed by metrics, for example
‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and
the prior days_back.
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).

BigQueryGetDataOperator

class airflow.contrib.operators.bigquery_get_data.BigQueryGetDataOperator(**kwargs)
Bases: airflow.models.BaseOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a
python list. The number of elements in the returned list will be equal to the number of rows fetched. Each
element in the list will again be a list where element would represent the columns values for that row.
Example Result: [['Tony', '10'], ['Mike', '20'], ['Steve', '15']]

3.16. Integration 165


Airflow Documentation, Release 1.10.2

Note: If you pass fields to selected_fields which are in different order than the order of columns already
in BQ table, the data will still be in the order of BQ table. For example if the BQ table has 3 columns as
[A,B,C] and you pass ‘B,A’ in the selected_fields the data would still be of the form 'A,B'.

Example:

get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)

Parameters
• dataset_id (string) – The dataset ID of the requested table. (templated)
• table_id (string) – The table ID of the requested table. (templated)
• max_results (string) – The maximum number of records (rows) to be fetched from
the table. (templated)
• selected_fields (string) – List of fields to return (comma-separated). If unspeci-
fied, all fields are returned.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

BigQueryCreateEmptyTableOperator

class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyTableOperator(**kwargs)
Bases: airflow.models.BaseOperator
Creates a new, empty table in the specified BigQuery dataset, optionally with schema.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass
the schema fields in, or you may point the operator to a Google cloud storage object name. The object in Google
cloud storage must be a JSON file with the schema fields in it. You can also create a table without schema.
Parameters
• project_id (string) – The project to create the table into. (templated)
• dataset_id (string) – The dataset to create the table into. (templated)
• table_id (string) – The Name of the table to be created. (templated)
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:

166 Chapter 3. Content


Airflow Documentation, Release 1.10.2

schema_fields=[{"name": "emp_name", "type": "STRING", "mode":


˓→"REQUIRED"},

{"name": "salary", "type": "INTEGER", "mode":


˓→"NULLABLE"}]

• gcs_schema_object (string) – Full path to the JSON file containing schema


(templated). For example: gs://test-bucket/dir1/dir2/employee_schema.
json
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications.
See also:
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#timePartitioning
• bigquery_conn_id (string) – Reference to a specific BigQuery hook.
• google_cloud_storage_conn_id (string) – Reference to a specific Google
cloud storage hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the table, passed to BigQuery
Example (with schema JSON in GCS):
CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
gcs_schema_object='gs://schema-bucket/employee_schema.json',
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)

Corresponding Schema file (employee_schema.json):


[
{
"mode": "NULLABLE",
"name": "emp_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "salary",
"type": "INTEGER"
}
]

Example (with schema in the DAG):


CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
(continues on next page)

3.16. Integration 167


Airflow Documentation, Release 1.10.2

(continued from previous page)


schema_fields=[{"name": "emp_name", "type": "STRING", "mode":
˓→"REQUIRED"},

{"name": "salary", "type": "INTEGER", "mode":


˓→"NULLABLE"}],

bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)

BigQueryCreateExternalTableOperator

class airflow.contrib.operators.bigquery_operator.BigQueryCreateExternalTableOperator(**kwa
Bases: airflow.models.BaseOperator
Creates a new external table in the dataset with the data in Google Cloud Storage.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly
pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in
Google cloud storage must be a JSON file with the schema fields in it.
Parameters
• bucket (string) – The bucket to point the external table to. (templated)
• source_objects (list) – List of Google cloud storage URIs to point table to. (tem-
plated) If source_format is ‘DATASTORE_BACKUP’, the list must only contain a single
URI.
• destination_project_dataset_table (string) – The dotted
(<project>.)<dataset>.<table> BigQuery table to load data into (templated). If <project> is
not included, project will be the project defined in the connection json.
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:

schema_fields=[{"name": "emp_name", "type": "STRING", "mode":


˓→"REQUIRED"},

{"name": "salary", "type": "INTEGER", "mode":


˓→"NULLABLE"}]

Should not be set when source_format is ‘DATASTORE_BACKUP’.


• schema_object (string) – If set, a GCS object path pointing to a .json file that con-
tains the schema for the table. (templated)
• source_format (string) – File format of the data.
• compression (string) – [Optional] The compression type of the data source. Possible
values include GZIP and NONE. The default value is NONE. This setting is ignored for
Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.
• skip_leading_rows (int) – Number of rows to skip when loading from a CSV.
• field_delimiter (string) – The delimiter to use for the CSV.
• max_bad_records (int) – The maximum number of bad records that BigQuery can
ignore when running the job.

168 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• quote_character (string) – The value that is used to quote data sections in a CSV
file.
• allow_quoted_newlines (boolean) – Whether to allow quoted newlines (true) or
not (false).
• allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns.
The missing values are treated as nulls. If false, records with missing trailing columns are
treated as bad records, and if there are too many bad records, an invalid error is returned in
the job result. Only applicable to CSV, ignored for other formats.
• bigquery_conn_id (string) – Reference to a specific BigQuery hook.
• google_cloud_storage_conn_id (string) – Reference to a specific Google
cloud storage hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• src_fmt_configs (dict) – configure optional fields specific to the source format
:param labels a dictionary containing labels for the table, passed to BigQuery :type labels: dict

BigQueryCreateEmptyDatasetOperator

class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyDatasetOperator(**kwarg
Bases: airflow.models.BaseOperator
” This operator is used to create new dataset for your Project in Big query. https://cloud.google.com/bigquery/
docs/reference/rest/v2/datasets#resource
Parameters
• project_id (str) – The name of the project where we want to create the dataset. Don’t
need to provide, if projectId in dataset_reference.
• dataset_id (str) – The id of dataset. Don’t need to provide, if datasetId in
dataset_reference.
• dataset_reference – Dataset reference that could be provided with request body.
More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource

BigQueryDeleteDatasetOperator

class airflow.contrib.operators.bigquery_operator.BigQueryDeleteDatasetOperator(**kwargs)
Bases: airflow.models.BaseOperator
” This operator deletes an existing dataset from your Project in Big query. https://cloud.google.com/bigquery/
docs/reference/rest/v2/datasets/delete :param project_id: The project id of the dataset. :type project_id: string
:param dataset_id: The dataset to be deleted. :type dataset_id: string
Example:

delete_temp_data = BigQueryDeleteDatasetOperator(dataset_id = 'temp-dataset',


project_id = 'temp-project',
bigquery_conn_id='_my_gcp_conn_',
task_id='Deletetemp',
dag=dag)

3.16. Integration 169


Airflow Documentation, Release 1.10.2

BigQueryTableDeleteOperator

class airflow.contrib.operators.bigquery_table_delete_operator.BigQueryTableDeleteOperator(
Bases: airflow.models.BaseOperator
Deletes BigQuery tables
Parameters
• deletion_dataset_table (string) – A dotted
(<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.
(templated)
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• ignore_if_missing (boolean) – if True, then return success even if the requested
table does not exist.

BigQueryOperator

class airflow.contrib.operators.bigquery_operator.BigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes BigQuery SQL queries in a specific BigQuery database
Parameters
• bql (Can receive a str representing a sql statement, a list
of str (sql statements), or reference to a template file.
Template reference are recognized by str ending in '.sql'.) –
(Deprecated. Use sql parameter instead) the sql code to be executed (templated)
• sql (Can receive a str representing a sql statement, a list
of str (sql statements), or reference to a template file.
Template reference are recognized by str ending in '.sql'.) –
the sql code to be executed (templated)
• destination_dataset_table (string) – A dotted
(<project>.|<project>:)<dataset>.<table> that, if set, will store the results of the query.
(templated)
• write_disposition (string) – Specifies the action that occurs if the destination
table already exists. (default: ‘WRITE_EMPTY’)
• create_disposition (string) – Specifies whether the job is allowed to create new
tables. (default: ‘CREATE_IF_NEEDED’)
• allow_large_results (boolean) – Whether to allow large results.
• flatten_results (boolean) – If true and query uses legacy SQL dialect, flattens all
nested and repeated fields in the query results. allow_large_results must be true
if this is set to false. For standard SQL queries, this flag is ignored and results are never
flattened.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

170 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• udf_config (list) – The User Defined Function configuration for the query. See https:
//cloud.google.com/bigquery/user-defined-functions for details.
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
• maximum_billing_tier (integer) – Positive integer that serves as a multiplier of
the basic price. Defaults to None, in which case it uses the value set in the project.
• maximum_bytes_billed (float) – Limits the bytes billed for this job. Queries that
will have bytes billed beyond this limit will fail (without incurring a charge). If unspecified,
this will be set to your project default.
• api_resource_configs (dict) – a dictionary that contain params ‘configuration’
applied for Google BigQuery Jobs API: https://cloud.google.com/bigquery/docs/reference/
rest/v2/jobs for example, {‘query’: {‘useQueryCache’: False}}. You could use it if you
need to provide some params that are not supported by BigQueryOperator like args.
• schema_update_options (tuple) – Allows the schema of the destination table to be
updated as a side effect of the load job.
• query_params (dict) – a dictionary containing query parameter types and values,
passed to BigQuery.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery
• priority (string) – Specifies a priority for the query. Possible values include INTER-
ACTIVE and BATCH. The default value is INTERACTIVE.
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications.
• cluster_fields (list of str) – Request that the result of this query be stored
sorted by one or more columns. This is only available in conjunction with time_partitioning.
The order of columns given determines the sort order.
• location (str) – The geographic location of the job. Required except for US and EU.
See details at https://cloud.google.com/bigquery/docs/locations#specifying_your_location

BigQueryToBigQueryOperator

class airflow.contrib.operators.bigquery_to_bigquery.BigQueryToBigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from one BigQuery table to another.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs#
configuration.copy

Parameters
• source_project_dataset_tables (list|string) – One or more dotted
(project:|project.)<dataset>.<table> BigQuery tables to use as the source data. If <project>
is not included, project will be the project defined in the connection json. Use a list if there
are multiple source tables. (templated)
• destination_project_dataset_table (string) – The destination BigQuery
table. Format is: (project:|project.)<dataset>.<table> (templated)

3.16. Integration 171


Airflow Documentation, Release 1.10.2

• write_disposition (string) – The write disposition if the table already exists.


• create_disposition (string) – The create disposition if the table doesn’t exist.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

BigQueryToCloudStorageOperator

class airflow.contrib.operators.bigquery_to_gcs.BigQueryToCloudStorageOperator(**kwargs)
Bases: airflow.models.BaseOperator
Transfers a BigQuery table to a Google Cloud Storage bucket.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs

Parameters
• source_project_dataset_table (string) – The dotted (<project>.
|<project>:)<dataset>.<table> BigQuery table to use as the source data. If
<project> is not included, project will be the project defined in the connection json. (tem-
plated)
• destination_cloud_storage_uris (list) – The destination Google Cloud Stor-
age URI (e.g. gs://some-bucket/some-file.txt). (templated) Follows convention defined here:
https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple
• compression (string) – Type of compression to use.
• export_format (string) – File format to export.
• field_delimiter (string) – The delimiter to use when extracting to a CSV.
• print_header (boolean) – Whether to print a header for a CSV file extract.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery

BigQueryHook

class airflow.contrib.hooks.bigquery_hook.BigQueryHook(bigquery_conn_id=’bigquery_default’,
delegate_to=None,
use_legacy_sql=True,
location=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook, airflow.
hooks.dbapi_hook.DbApiHook, airflow.utils.log.logging_mixin.LoggingMixin
Interact with BigQuery. This hook uses the Google Cloud Platform connection.
get_conn()
Returns a BigQuery PEP 249 connection object.

172 Chapter 3. Content


Airflow Documentation, Release 1.10.2

get_pandas_df(sql, parameters=None, dialect=None)


Returns a Pandas DataFrame for the results produced by a BigQuery query. The DbApiHook method must
be overridden because Pandas doesn’t support PEP 249 connections, except for SQLite. See:
https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L447 https://github.com/pydata/pandas/
issues/6900
Parameters
• sql (string) – The BigQuery SQL to execute.
• parameters (mapping or iterable) – The parameters to render the SQL query
with (not used, leave to override superclass method)
• dialect (string in {'legacy', 'standard'}) – Dialect of BigQuery SQL
– legacy SQL or standard SQL defaults to use self.use_legacy_sql if not specified
get_service()
Returns a BigQuery service object.
insert_rows(table, rows, target_fields=None, commit_every=1000)
Insertion is currently unsupported. Theoretically, you could use BigQuery’s streaming API to insert rows
into a table, but this hasn’t been implemented.
table_exists(project_id, dataset_id, table_id)
Checks for the existence of a table in Google BigQuery.
Parameters
• project_id (string) – The Google cloud project in which to look for the table. The
connection supplied to the hook must provide access to the specified project.
• dataset_id (string) – The name of the dataset in which to look for the table.
• table_id (string) – The name of the table to check the existence of.

3.16.5.4 Cloud Spanner

Cloud Spanner Operators

• CloudSpannerInstanceDatabaseDeleteOperator : deletes an existing database from a Google Cloud Spanner


instance or returns success if the database is missing.
• CloudSpannerInstanceDatabaseDeployOperator : creates a new database in a Google Cloud instance or returns
success if the database already exists.
• CloudSpannerInstanceDatabaseUpdateOperator : updates the structure of a Google Cloud Spanner database.
• CloudSpannerInstanceDatabaseQueryOperator : executes an arbitrary DML query (INSERT, UPDATE,
DELETE).
• CloudSpannerInstanceDeployOperator : creates a new Google Cloud Spanner instance, or if an instance with
the same name exists, updates the instance.
• CloudSpannerInstanceDeleteOperator : deletes a Google Cloud Spanner instance.

3.16. Integration 173


Airflow Documentation, Release 1.10.2

CloudSpannerInstanceDatabaseDeleteOperator

CloudSpannerInstanceDatabaseDeployOperator

CloudSpannerInstanceDatabaseUpdateOperator

CloudSpannerInstanceDatabaseQueryOperator

CloudSpannerInstanceDeployOperator

CloudSpannerInstanceDeleteOperator

CloudSpannerHook

3.16.5.5 Cloud SQL

Cloud SQL Operators

• CloudSqlInstanceDatabaseDeleteOperator : deletes a database from a Cloud SQL instance.


• CloudSqlInstanceDatabaseCreateOperator : creates a new database inside a Cloud SQL instance.
• CloudSqlInstanceDatabasePatchOperator : updates a database inside a Cloud SQL instance.
• CloudSqlInstanceDeleteOperator : delete a Cloud SQL instance.
• CloudSqlInstanceExportOperator : exports data from a Cloud SQL instance.
• CloudSqlInstanceImportOperator : imports data into a Cloud SQL instance.
• CloudSqlInstanceCreateOperator : create a new Cloud SQL instance.
• CloudSqlInstancePatchOperator : patch a Cloud SQL instance.
• CloudSqlQueryOperator : run query in a Cloud SQL instance.

174 Chapter 3. Content


Airflow Documentation, Release 1.10.2

CloudSqlInstanceDatabaseDeleteOperator

CloudSqlInstanceDatabaseCreateOperator

CloudSqlInstanceDatabasePatchOperator

CloudSqlInstanceDeleteOperator

CloudSqlInstanceExportOperator

CloudSqlInstanceImportOperator

CloudSqlInstanceCreateOperator

CloudSqlInstancePatchOperator

CloudSqlQueryOperator

Cloud SQL Hooks

3.16.5.6 Cloud Bigtable

Cloud Bigtable Operators

• BigtableInstanceCreateOperator : creates a Cloud Bigtable instance.


• BigtableInstanceDeleteOperator : deletes a Google Cloud Bigtable instance.
• BigtableClusterUpdateOperator : updates the number of nodes in a Google Cloud Bigtable cluster.
• BigtableTableCreateOperator : creates a table in a Google Cloud Bigtable instance.
• BigtableTableDeleteOperator : deletes a table in a Google Cloud Bigtable instance.
• BigtableTableWaitForReplicationSensor : (sensor) waits for a table to be fully replicated.

3.16. Integration 175


Airflow Documentation, Release 1.10.2

BigtableInstanceCreateOperator

BigtableInstanceDeleteOperator

BigtableClusterUpdateOperator

BigtableTableCreateOperator

BigtableTableDeleteOperator

BigtableTableWaitForReplicationSensor

Cloud Bigtable Hook

3.16.5.7 Compute Engine

Compute Engine Operators

• GceInstanceStartOperator : start an existing Google Compute Engine instance.


• GceInstanceStopOperator : stop an existing Google Compute Engine instance.
• GceSetMachineTypeOperator : change the machine type for a stopped instance.
• GceInstanceTemplateCopyOperator : copy the Instance Template, applying specified changes.
• GceInstanceGroupManagerUpdateTemplateOperator : patch the Instance Group Manager, replacing source
Instance Template URL with the destination one.
The operators have the common base operator:
class airflow.contrib.operators.gcp_compute_operator.GceBaseOperator(**kwargs)
Bases: airflow.models.BaseOperator
Abstract base operator for Google Compute Engine operators to inherit from.
They also use Compute Engine Hook to communicate with Google Cloud Platform.

GceInstanceStartOperator

class airflow.contrib.operators.gcp_compute_operator.GceInstanceStartOperator(**kwargs)
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Starts an instance in Google Compute Engine.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists.
• resource_id (str) – Name of the Compute Engine instance resource.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.

176 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body – Optional, If set to False, body validation is not performed. Defaults to
False.

GceInstanceStopOperator

class airflow.contrib.operators.gcp_compute_operator.GceInstanceStopOperator(**kwargs)
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Stops an instance in Google Compute Engine.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists.
• resource_id (str) – Name of the Compute Engine instance resource.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body – Optional, If set to False, body validation is not performed. Defaults to
False.

GceSetMachineTypeOperator

class airflow.contrib.operators.gcp_compute_operator.GceSetMachineTypeOperator(**kwargs)
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Changes the machine type for a stopped instance to the machine type specified in the request.

Parameters
• zone (str) – Google Cloud Platform zone where the instance exists.
• resource_id (str) – Name of the Compute Engine instance resource.
• body (dict) – Body required by the Compute Engine setMachineType API, as described
in https://cloud.google.com/compute/docs/reference/rest/v1/instances/setMachineType#
request-body
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body (bool) – Optional, If set to False, body validation is not performed.
Defaults to False.

3.16. Integration 177


Airflow Documentation, Release 1.10.2

GceInstanceTemplateCopyOperator

class airflow.contrib.operators.gcp_compute_operator.GceInstanceTemplateCopyOperator(**kwarg
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Copies the instance template, applying specified changes.
Parameters
• resource_id (str) – Name of the Instance Template
• body_patch (dict) – Patch to the body of instanceTemplates object following rfc7386
PATCH semantics. The body_patch content follows https://cloud.google.com/compute/
docs/reference/rest/v1/instanceTemplates Name field is required as we need to rename the
template, all the other fields are optional. It is important to follow PATCH semantics - ar-
rays are replaced fully, so if you need to update an array you should provide the whole target
array as patch element.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• request_id (str) – Optional, unique request_id that you might add to achieve full idem-
potence (for example when client call times out repeating the request with the same request
id will not create a new instance template again). It should be in UUID format as defined in
RFC 4122.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body (bool) – Optional, If set to False, body validation is not performed.
Defaults to False.

GceInstanceGroupManagerUpdateTemplateOperator

class airflow.contrib.operators.gcp_compute_operator.GceInstanceGroupManagerUpdateTemplateO
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Patches the Instance Group Manager, replacing source template URL with the destination one. API V1 does not
have update/patch operations for Instance Group Manager, so you must use beta or newer API version. Beta is
the default.
Parameters
• resource_id (str) – Name of the Instance Group Manager
• zone (str) – Google Cloud Platform zone where the Instance Group Manager exists.
• source_template (str) – URL of the template to replace.
• destination_template (str) – URL of the target template.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• request_id (str) – Optional, unique request_id that you might add to achieve full idem-
potence (for example when client call times out repeating the request with the same request

178 Chapter 3. Content


Airflow Documentation, Release 1.10.2

id will not create a new instance template again). It should be in UUID format as defined in
RFC 4122.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body (bool) – Optional, If set to False, body validation is not performed.
Defaults to False.

Compute Engine Hook

class airflow.contrib.hooks.gcp_compute_hook.GceHook(api_version=’v1’,
gcp_conn_id=’google_cloud_default’,
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Hook for Google Compute Engine APIs.
All the methods in the hook where project_id is used must be called with keyword arguments rather than posi-
tional.
get_conn()
Retrieves connection to Google Compute Engine.
Returns Google Compute Engine services object
Return type dict
get_instance_group_manager(*args, **kwargs)
Retrieves Instance Group Manager by project_id, zone and resource_id. Must be called with keyword
arguments rather than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the Instance Group Manager exists
• resource_id (str) – Name of the Instance Group Manager
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns Instance group manager representation as object according to https://cloud.google.com/
compute/docs/reference/rest/beta/instanceGroupManagers
Return type dict
get_instance_template(*args, **kwargs)
Retrieves instance template by project_id and resource_id. Must be called with keyword arguments rather
than positional.
Parameters
• resource_id (str) – Name of the instance template
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.

3.16. Integration 179


Airflow Documentation, Release 1.10.2

Returns Instance template representation as object according to https://cloud.google.com/


compute/docs/reference/rest/v1/instanceTemplates
Return type dict
insert_instance_template(*args, **kwargs)
Inserts instance template using body specified Must be called with keyword arguments rather than posi-
tional.
Parameters
• body (dict) – Instance template representation as object according to https://cloud.
google.com/compute/docs/reference/rest/v1/instanceTemplates
• request_id (str) – Optional, unique request_id that you might add to achieve full
idempotence (for example when client call times out repeating the request with the same
request id will not create a new instance template again) It should be in UUID format as
defined in RFC 4122
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns None
patch_instance_group_manager(*args, **kwargs)
Patches Instance Group Manager with the specified body. Must be called with keyword arguments rather
than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the Instance Group Manager exists
• resource_id (str) – Name of the Instance Group Manager
• body (dict) – Instance Group Manager representation as json-merge-patch object ac-
cording to https://cloud.google.com/compute/docs/reference/rest/beta/instanceTemplates/
patch
• request_id (str) – Optional, unique request_id that you might add to achieve full
idempotence (for example when client call times out repeating the request with the same
request id will not create a new instance template again). It should be in UUID format as
defined in RFC 4122
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
:return None
set_machine_type(*args, **kwargs)
Sets machine type of an instance defined by project_id, zone and resource_id. Must be called with keyword
arguments rather than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists.
• resource_id (str) – Name of the Compute Engine instance resource
• body (dict) – Body required by the Compute Engine setMachineType API, as described
in https://cloud.google.com/compute/docs/reference/rest/v1/instances/setMachineType

180 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns None
start_instance(*args, **kwargs)
Starts an existing instance defined by project_id, zone and resource_id. Must be called with keyword
arguments rather than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists
• resource_id (str) – Name of the Compute Engine instance resource
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns None
stop_instance(*args, **kwargs)
Stops an instance defined by project_id, zone and resource_id Must be called with keyword arguments
rather than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists
• resource_id (str) – Name of the Compute Engine instance resource
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns None
members

3.16.5.8 Cloud Functions

Cloud Functions Operators

• GcfFunctionDeployOperator : deploy Google Cloud Function to Google Cloud Platform


• GcfFunctionDeleteOperator : delete Google Cloud Function in Google Cloud Platform
They also use Cloud Functions Hook to communicate with Google Cloud Platform.

GcfFunctionDeployOperator

class airflow.contrib.operators.gcp_function_operator.GcfFunctionDeployOperator(**kwargs)
Bases: airflow.models.BaseOperator
Creates a function in Google Cloud Functions. If a function with this name already exists, it will be updated.
Parameters
• location (str) – Google Cloud Platform region where the function should be created.

3.16. Integration 181


Airflow Documentation, Release 1.10.2

• body (dict or google.cloud.functions.v1.CloudFunction) – Body of


the Cloud Functions definition. The body must be a Cloud Functions dictionary as described
in: https://cloud.google.com/functions/docs/reference/rest/v1/projects.locations.functions .
Different API versions require different variants of the Cloud Functions dictionary.
• project_id (str) – (Optional) Google Cloud Platform project ID where the function
should be created.
• gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud
Platform - default ‘google_cloud_default’.
• api_version (str) – (Optional) API version used (for example v1 - default - or
v1beta1).
• zip_path (str) – Path to zip file containing source code of the function. If the path is
set, the sourceUploadUrl should not be specified in the body or it should be empty. Then
the zip file will be uploaded using the upload URL generated via generateUploadUrl from
the Cloud Functions API.
• validate_body (bool) – If set to False, body validation is not performed.

GcfFunctionDeleteOperator

class airflow.contrib.operators.gcp_function_operator.GcfFunctionDeleteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Deletes the specified function from Google Cloud Functions.
Parameters
• name (str) – A fully-qualified function name, matching the pattern:
^projects/[^/]+/locations/[^/]+/functions/[^/]+$
• gcp_conn_id (str) – The connection ID to use to connect to Google Cloud Platform.
• api_version (str) – API version used (for example v1 or v1beta1).

Cloud Functions Hook

class airflow.contrib.hooks.gcp_function_hook.GcfHook(api_version,
gcp_conn_id=’google_cloud_default’,
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Hook for the Google Cloud Functions APIs.
create_new_function(*args, **kwargs)
Creates a new function in Cloud Function in the location specified in the body.
Parameters
• location (str) – The location of the function.
• body (dict) – The body required by the Cloud Functions insert API.
• project_id (str) – Optional, Google Cloud Project project_id where the function
belongs. If set to None or missing, the default project_id from the GCP connection is
used.
Returns None

182 Chapter 3. Content


Airflow Documentation, Release 1.10.2

delete_function(name)
Deletes the specified Cloud Function.
Parameters name (str) – The name of the function.
Returns None
get_conn()
Retrieves the connection to Cloud Functions.
Returns Google Cloud Function services object.
Return type dict
get_function(name)
Returns the Cloud Function with the given name.
Parameters name (str) – Name of the function.
Returns A Cloud Functions object representing the function.
Return type dict
update_function(name, body, update_mask)
Updates Cloud Functions according to the specified update mask.
Parameters
• name (str) – The name of the function.
• body (dict) – The body required by the cloud function patch API.
• update_mask ([str]) – The update mask - array of fields that should be patched.
Returns None
upload_function_zip(*args, **kwargs)
Uploads zip file with sources.
Parameters
• location (str) – The location where the function is created.
• zip_path (str) – The path of the valid .zip file to upload.
• project_id (str) – Optional, Google Cloud Project project_id where the function
belongs. If set to None or missing, the default project_id from the GCP connection is
used.
Returns The upload URL that was returned by generateUploadUrl method.

3.16.5.9 Cloud DataFlow

DataFlow Operators

• DataFlowJavaOperator : launching Cloud Dataflow jobs written in Java.


• DataflowTemplateOperator : launching a templated Cloud DataFlow batch job.
• DataFlowPythonOperator : launching Cloud Dataflow jobs written in python.

3.16. Integration 183


Airflow Documentation, Release 1.10.2

DataFlowJavaOperator

class airflow.contrib.operators.dataflow_operator.DataFlowJavaOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Java Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params

Parameters
• jar (string) – The reference to a self executing DataFlow jar.
• dataflow_default_options (dict) – Map of default job options.
• options (dict) – Map of job specific options.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
• job_class (string) – The name of the dataflow job class to be executued, it is often
not the main class configured in the dataflow jar file.

Both jar and options are templated so you can use variables in them.
Note that both dataflow_default_options and options will be merged to specify pipeline execution
parameter, and dataflow_default_options is expected to save high-level options, for instances, project
and zone information, which apply to all dataflow operators in the DAG.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.

default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'europe-west1-d',
'stagingLocation': 'gs://my-staging-bucket/staging/'
}
}

You need to pass the path to your dataflow as a file reference with the jar parameter, the jar needs to
be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/
#self-executing-jar). Use options to pass on options to your job.

t1 = DataFlowJavaOperator(
task_id='datapflow_example',
jar='{{var.value.gcp_dataflow_base}}pipeline/build/libs/pipeline-example-1.0.
˓→jar',

options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
(continues on next page)

184 Chapter 3. Content


Airflow Documentation, Release 1.10.2

(continued from previous page)


'start': '{{ds}}',
'partitionType': 'DAY',
'labels': {'foo' : 'bar'}
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date':
(2016, 8, 1),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=30),
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'us-central1-f',
'stagingLocation': 'gs://bucket/tmp/dataflow/staging/',
}
}

dag = DAG('test-dag', default_args=default_args)

task = DataFlowJavaOperator(
gcp_conn_id='gcp_default',
task_id='normalize-cal',
jar='{{var.value.gcp_dataflow_base}}pipeline-ingress-cal-normalize-1.0.jar',
options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
'start': '{{ds}}',
'partitionType': 'DAY'

},
dag=dag)

DataflowTemplateOperator

class airflow.contrib.operators.dataflow_operator.DataflowTemplateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Templated Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
Parameters
• template (string) – The reference to the DataFlow template.
• dataflow_default_options (dict) – Map of default job environment options.
• parameters (dict) – Map of job specific parameters for the template.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.

3.16. Integration 185


Airflow Documentation, Release 1.10.2

• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters https://cloud.google.
com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment

default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project'
'zone': 'europe-west1-d',
'tempLocation': 'gs://my-staging-bucket/staging/'
}
}
}

You need to pass the path to your dataflow template as a file reference with the template parameter. Use
parameters to pass on parameters to your job. Use environment to pass on runtime environment variables
to your job.

t1 = DataflowTemplateOperator(
task_id='datapflow_example',
template='{{var.value.gcp_dataflow_base}}',
parameters={
'inputFile': "gs://bucket/input/my_input.txt",
'outputFile': "gs://bucket/output/my_output.txt"
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)

template, dataflow_default_options and parameters are templated so you can use variables in
them.
Note that dataflow_default_options is expected to save high-level options for project information,
which apply to all dataflow operators in the DAG.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3 /LaunchTemplateParameters
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment For more de-
tail on job template execution have a look at the reference: https://cloud.google.com/dataflow/docs/
templates/executing-templates

DataFlowPythonOperator

class airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator(**kwargs)
Bases: airflow.models.BaseOperator
Launching Cloud Dataflow jobs written in python. Note that both dataflow_default_options and options will
be merged to specify pipeline execution parameter, and dataflow_default_options is expected to save high-level
options, for instances, project and zone information, which apply to all dataflow operators in the DAG.

186 Chapter 3. Content


Airflow Documentation, Release 1.10.2

See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params

Parameters
• py_file (string) – Reference to the python dataflow pipleline file.py, e.g.,
/some/local/file/path/to/your/python/pipeline/file.
• py_options – Additional python options.
• dataflow_default_options (dict) – Map of default job options.
• options (dict) – Map of job specific options.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.

execute(context)
Execute the python dataflow job.

DataFlowHook

class airflow.contrib.hooks.gcp_dataflow_hook.DataFlowHook(gcp_conn_id=’google_cloud_default’,
delegate_to=None,
poll_sleep=10)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
get_conn()
Returns a Google Cloud Dataflow service object.

3.16.5.10 Cloud DataProc

DataProc Operators

• DataprocClusterCreateOperator : Create a new cluster on Google Cloud Dataproc.


• DataprocClusterDeleteOperator : Delete a cluster on Google Cloud Dataproc.
• DataprocClusterScaleOperator : Scale up or down a cluster on Google Cloud Dataproc.
• DataProcPigOperator : Start a Pig query Job on a Cloud DataProc cluster.
• DataProcHiveOperator : Start a Hive query Job on a Cloud DataProc cluster.
• DataProcSparkSqlOperator : Start a Spark SQL query Job on a Cloud DataProc cluster.
• DataProcSparkOperator : Start a Spark Job on a Cloud DataProc cluster.
• DataProcHadoopOperator : Start a Hadoop Job on a Cloud DataProc cluster.
• DataProcPySparkOperator : Start a PySpark Job on a Cloud DataProc cluster.
• DataprocWorkflowTemplateInstantiateOperator : Instantiate a WorkflowTemplate on Google Cloud Dataproc.

3.16. Integration 187


Airflow Documentation, Release 1.10.2

• DataprocWorkflowTemplateInstantiateInlineOperator : Instantiate a WorkflowTemplate Inline on Google Cloud


Dataproc.

DataprocClusterCreateOperator

class airflow.contrib.operators.dataproc_operator.DataprocClusterCreateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Create a new cluster on Google Cloud Dataproc. The operator will wait until the creation is successful or an
error occurs in the creation process.
The parameters allow to configure the cluster. Please refer to
https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters
for a detailed explanation on the different parameters. Most of the configuration parameters detailed in the link
are available as a parameter to this operator.
Parameters
• cluster_name (string) – The name of the DataProc cluster to create. (templated)
• project_id (str) – The ID of the google cloud project in which to create the cluster.
(templated)
• num_workers (int) – The # of workers to spin up. If set to zero will spin up cluster in a
single node mode
• storage_bucket (string) – The storage bucket to use, setting to None lets dataproc
generate a custom one for you
• init_actions_uris (list[string]) – List of GCS uri’s containing dataproc ini-
tialization scripts
• init_action_timeout (string) – Amount of time executable scripts in
init_actions_uris has to complete
• metadata (dict) – dict of key-value google compute engine metadata entries to add to
all instances
• image_version (string) – the version of software inside the Dataproc cluster
• custom_image – custom Dataproc image for more info see https://cloud.google.com/
dataproc/docs/guides/dataproc-images
• properties (dict) – dict of properties to set on config files (e.g. spark-defaults.conf),
see https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#
SoftwareConfig
• master_machine_type (string) – Compute engine machine type to use for the mas-
ter node
• master_disk_type (string) – Type of the boot disk for the master node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• master_disk_size (int) – Disk size for the master node
• worker_machine_type (string) – Compute engine machine type to use for the
worker nodes

188 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• worker_disk_type (string) – Type of the boot disk for the worker node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• worker_disk_size (int) – Disk size for the worker nodes
• num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
• labels (dict) – dict of labels to add to the cluster
• zone (string) – The zone where the cluster will be located. (templated)
• network_uri (string) – The network uri to be used for machine communication, can-
not be specified with subnetwork_uri
• subnetwork_uri (string) – The subnetwork uri to be used for machine communica-
tion, cannot be specified with network_uri
• internal_ip_only (bool) – If true, all instances in the cluster will only have internal
IP addresses. This can only be enabled for subnetwork enabled networks
• tags (list[string]) – The GCE tags to add to all instances
• region – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• service_account (string) – The service account of the dataproc instances.
• service_account_scopes (list[string]) – The URIs of service account scopes
to be included.
• idle_delete_ttl (int) – The longest duration that cluster would keep alive while
staying idle. Passing this threshold will cause cluster to be auto-deleted. A duration in
seconds.
• auto_delete_time (datetime.datetime) – The time when cluster will be auto-
deleted.
• auto_delete_ttl (int) – The life duration of cluster, the cluster will be auto-deleted
at the end of this duration. A duration in seconds. (If auto_delete_time is set this parameter
will be ignored)
Type custom_image: string

DataprocClusterScaleOperator

class airflow.contrib.operators.dataproc_operator.DataprocClusterScaleOperator(**kwargs)
Bases: airflow.models.BaseOperator
Scale, up or down, a cluster on Google Cloud Dataproc. The operator will wait until the cluster is re-scaled.
Example:

t1 = DataprocClusterScaleOperator(
task_id='dataproc_scale',
project_id='my-project',
cluster_name='cluster-1',
(continues on next page)

3.16. Integration 189


Airflow Documentation, Release 1.10.2

(continued from previous page)


num_workers=10,
num_preemptible_workers=10,
graceful_decommission_timeout='1h',
dag=dag)

See also:
For more detail on about scaling clusters have a look at the reference: https://cloud.google.com/dataproc/docs/
concepts/configuring-clusters/scaling-clusters

Parameters
• cluster_name (string) – The name of the cluster to scale. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – The region for the dataproc cluster. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• num_workers (int) – The new number of workers
• num_preemptible_workers (int) – The new number of preemptible workers
• graceful_decommission_timeout (string) – Timeout for graceful YARN de-
comissioning. Maximum value is 1d
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

DataprocClusterDeleteOperator

class airflow.contrib.operators.dataproc_operator.DataprocClusterDeleteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Delete a cluster on Google Cloud Dataproc. The operator will wait until the cluster is destroyed.
Parameters
• cluster_name (string) – The name of the cluster to create. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

DataProcPigOperator

class airflow.contrib.operators.dataproc_operator.DataProcPigOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.

190 Chapter 3. Content


Airflow Documentation, Release 1.10.2

It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and
UDFs.

default_args = {
'cluster_name': 'cluster-1',
'dataproc_pig_jars': [
'gs://example/udf/jar/datafu/1.2.0/datafu.jar',
'gs://example/udf/jar/gpig/1.2/gpig.jar'
]
}

You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be
resolved on the cluster or use the parameters to be resolved in the script as template parameters.
Example:

t1 = DataProcPigOperator(
task_id='dataproc_pig',
query='a_pig_script.pig',
variables={'out': 'gs://example/output/{{ds}}'},
dag=dag)

See also:
For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/
reference/rest/v1/projects.regions.jobs

Parameters
• query (string) – The query or reference to the query file (pg or pig extension). (tem-
plated)
• query_uri (string) – The uri of a pig script on Cloud Storage.
• variables (dict) – Map of named parameters for the query. (templated)
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_pig_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].

3.16. Integration 191


Airflow Documentation, Release 1.10.2

Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.

DataProcHiveOperator

class airflow.contrib.operators.dataproc_operator.DataProcHiveOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Hive query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension).
• query_uri (string) – The uri of a hive script on Cloud Storage.
• variables (dict) – Map of named parameters for the query.
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes.
• cluster_name (string) – The name of the DataProc cluster.
• dataproc_hive_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.

DataProcSparkSqlOperator

class airflow.contrib.operators.dataproc_operator.DataProcSparkSqlOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Spark SQL query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension). (templated)
• query_uri (string) – The uri of a spark sql script on Cloud Storage.

192 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• variables (dict) – Map of named parameters for the query. (templated)


• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_spark_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.

DataProcSparkOperator

class airflow.contrib.operators.dataproc_operator.DataProcSparkOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Spark Job on a Cloud DataProc cluster.
Parameters
• main_jar (string) – URI of the job jar provisioned on Cloud Storage. (use this or the
main_class, not both together).
• main_class (string) – Name of the job class. (use this or the main_jar, not both
together).
• arguments (list) – Arguments for the job. (templated)
• archives (list) – List of archived files that will be unpacked in the work directory.
Should be stored in Cloud Storage.
• files (list) – List of files to be copied to the working directory
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_spark_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments

3.16. Integration 193


Airflow Documentation, Release 1.10.2

• dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example:


for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.

DataProcHadoopOperator

class airflow.contrib.operators.dataproc_operator.DataProcHadoopOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Hadoop Job on a Cloud DataProc cluster.
Parameters
• main_jar (string) – URI of the job jar provisioned on Cloud Storage. (use this or the
main_class, not both together).
• main_class (string) – Name of the job class. (use this or the main_jar, not both
together).
• arguments (list) – Arguments for the job. (templated)
• archives (list) – List of archived files that will be unpacked in the work directory.
Should be stored in Cloud Storage.
• files (list) – List of files to be copied to the working directory
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_hadoop_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_hadoop_jars (list) – URIs to jars provisioned in Cloud Storage (exam-
ple: for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.

194 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.

DataProcPySparkOperator

class airflow.contrib.operators.dataproc_operator.DataProcPySparkOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a PySpark Job on a Cloud DataProc cluster.
Parameters
• main (string) – [Required] The Hadoop Compatible Filesystem (HCFS) URI of the
main Python file to use as the driver. Must be a .py file.
• arguments (list) – Arguments for the job. (templated)
• archives (list) – List of archived files that will be unpacked in the work directory.
Should be stored in Cloud Storage.
• files (list) – List of files to be copied to the working directory
• pyfiles (list) – List of Python files to pass to the PySpark framework. Supported file
types: .py, .egg, and .zip
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster.
• dataproc_pyspark_properties (dict) – Map for the Pig properties. Ideal to put
in default arguments
• dataproc_pyspark_jars (list) – URIs to jars provisioned in Cloud Storage (exam-
ple: for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.

3.16. Integration 195


Airflow Documentation, Release 1.10.2

DataprocWorkflowTemplateInstantiateOperator

class airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateOperat
Bases: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate on Google Cloud Dataproc. The operator will wait until the WorkflowTemplate
is finished executing.
See also:
Please refer to: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.
workflowTemplates/instantiate

Parameters
• template_id (string) – The id of the template. (templated)
• project_id (string) – The ID of the google cloud project in which the template runs
• region (string) – leave as ‘global’, might become relevant in the future
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

DataprocWorkflowTemplateInstantiateInlineOperator

class airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateInline
Bases: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc. The operator will wait until the Work-
flowTemplate is finished executing.
See also:
Please refer to: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.
workflowTemplates/instantiateInline

Parameters
• template (map) – The template contents. (templated)
• project_id (string) – The ID of the google cloud project in which the template runs
• region (string) – leave as ‘global’, might become relevant in the future
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

3.16.5.11 Cloud Datastore

Datastore Operators

• DatastoreExportOperator : Export entities from Google Cloud Datastore to Cloud Storage.

196 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• DatastoreImportOperator : Import entities from Cloud Storage to Google Cloud Datastore.

DatastoreExportOperator

class airflow.contrib.operators.datastore_export_operator.DatastoreExportOperator(**kwargs)
Bases: airflow.models.BaseOperator
Export entities from Google Cloud Datastore to Cloud Storage
Parameters
• bucket (string) – name of the cloud storage bucket to backup data
• namespace (str) – optional namespace path in the specified Cloud Storage bucket to
backup data. If this namespace does not exist in GCS, it will be created.
• datastore_conn_id (string) – the name of the Datastore connection id to use
• cloud_storage_conn_id (string) – the name of the cloud storage connection id to
force-write backup
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• entity_filter (dict) – description of what data from the project is included in
the export, refer to https://cloud.google.com/datastore/docs/reference/rest/Shared.Types/
EntityFilter
• labels (dict) – client-assigned labels for cloud storage
• polling_interval_in_seconds (int) – number of seconds to wait before polling
for execution status again
• overwrite_existing (bool) – if the storage bucket + namespace is not empty, it will
be emptied prior to exports. This enables overwriting existing backups.
• xcom_push (bool) – push operation name to xcom for reference

DatastoreImportOperator

class airflow.contrib.operators.datastore_import_operator.DatastoreImportOperator(**kwargs)
Bases: airflow.models.BaseOperator
Import entities from Cloud Storage to Google Cloud Datastore
Parameters
• bucket (string) – container in Cloud Storage to store data
• file (string) – path of the backup metadata file in the specified Cloud Storage bucket.
It should have the extension .overall_export_metadata
• namespace (str) – optional namespace of the backup metadata file in the specified Cloud
Storage bucket.
• entity_filter (dict) – description of what data from the project is included in
the export, refer to https://cloud.google.com/datastore/docs/reference/rest/Shared.Types/
EntityFilter
• labels (dict) – client-assigned labels for cloud storage
• datastore_conn_id (string) – the name of the connection id to use

3.16. Integration 197


Airflow Documentation, Release 1.10.2

• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• polling_interval_in_seconds (int) – number of seconds to wait before polling
for execution status again
• xcom_push (bool) – push operation name to xcom for reference

DatastoreHook

class airflow.contrib.hooks.datastore_hook.DatastoreHook(datastore_conn_id=’google_cloud_datastore_defa
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Interact with Google Cloud Datastore. This hook uses the Google Cloud Platform connection.
This object is not threads safe. If you want to make multiple requests simultaneously, you will need to create a
hook per thread.
allocate_ids(partialKeys)
Allocate IDs for incomplete keys. see https://cloud.google.com/datastore/docs/reference/rest/v1/projects/
allocateIds
Parameters partialKeys – a list of partial keys
Returns a list of full keys.
begin_transaction()
Get a new transaction handle
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/beginTransaction

Returns a transaction handle

commit(body)
Commit a transaction, optionally creating, deleting or modifying some entities.
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/commit

Parameters body – the body of the commit request


Returns the response body of the commit request

delete_operation(name)
Deletes the long-running operation
Parameters name – the name of the operation resource
export_to_storage_bucket(bucket, namespace=None, entity_filter=None, labels=None)
Export entities from Cloud Datastore to Cloud Storage for backup
get_conn(version=’v1’)
Returns a Google Cloud Datastore service object.
get_operation(name)
Gets the latest state of a long-running operation
Parameters name – the name of the operation resource

198 Chapter 3. Content


Airflow Documentation, Release 1.10.2

import_from_storage_bucket(bucket, file, namespace=None, entity_filter=None, labels=None)


Import a backup from Cloud Storage to Cloud Datastore
lookup(keys, read_consistency=None, transaction=None)
Lookup some entities by key
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/lookup

Parameters
• keys – the keys to lookup
• read_consistency – the read consistency to use. default, strong or eventual. Cannot
be used with a transaction.
• transaction – the transaction to use, if any.
Returns the response body of the lookup request.

poll_operation_until_done(name, polling_interval_in_seconds)
Poll backup operation state until it’s completed
rollback(transaction)
Roll back a transaction
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/rollback

Parameters transaction – the transaction to roll back

run_query(body)
Run a query for entities.
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/runQuery

Parameters body – the body of the query request


Returns the batch of query results.

3.16.5.12 Cloud ML Engine

Cloud ML Engine Operators

• MLEngineBatchPredictionOperator : Start a Cloud ML Engine batch prediction job.


• MLEngineModelOperator : Manages a Cloud ML Engine model.
• MLEngineTrainingOperator : Start a Cloud ML Engine training job.
• MLEngineVersionOperator : Manages a Cloud ML Engine model version.

3.16. Integration 199


Airflow Documentation, Release 1.10.2

MLEngineBatchPredictionOperator

class airflow.contrib.operators.mlengine_operator.MLEngineBatchPredictionOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Google Cloud ML Engine prediction job.
NOTE: For model origin, users should consider exactly one from the three options below: 1. Populate ‘uri’
field only, which should be a GCS location that points to a tensorflow savedModel directory. 2. Populate
‘model_name’ field only, which refers to an existing model, and the default version of the model will be used. 3.
Populate both ‘model_name’ and ‘version_name’ fields, which refers to a specific version of a specific model.
In options 2 and 3, both model and version name should contain the minimal identifier. For instance, call

MLEngineBatchPredictionOperator(
...,
model_name='my_model',
version_name='my_version',
...)

if the desired model version is “projects/my_project/models/my_model/versions/my_version”.


See https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs for further documentation on the param-
eters.
Parameters
• project_id (string) – The Google Cloud project name where the prediction job is
submitted. (templated)
• job_id (string) – A unique id for the prediction job on Google Cloud ML Engine.
(templated)
• data_format (string) – The format of the input data. It will default to
‘DATA_FORMAT_UNSPECIFIED’ if is not provided or is not one of [“TEXT”,
“TF_RECORD”, “TF_RECORD_GZIP”].
• input_paths (list of string) – A list of GCS paths of input data for batch pre-
diction. Accepting wildcard operator *, but only at the end. (templated)
• output_path (string) – The GCS path where the prediction results are written to.
(templated)
• region (string) – The Google Compute Engine region to run the prediction job in.
(templated)
• model_name (string) – The Google Cloud ML Engine model to use for prediction. If
version_name is not provided, the default version of this model will be used. Should not be
None if version_name is provided. Should be None if uri is provided. (templated)
• version_name (string) – The Google Cloud ML Engine model version to use for
prediction. Should be None if uri is provided. (templated)
• uri (string) – The GCS path of the saved model to use for prediction. Should be None
if model_name is provided. It should be a GCS path pointing to a tensorflow SavedModel.
(templated)
• max_worker_count (int) – The maximum number of workers to be used for parallel
processing. Defaults to 10 if not specified.
• runtime_version (string) – The Google Cloud ML Engine runtime version to use
for batch prediction.

200 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• gcp_conn_id (string) – The connection ID used for connection to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have doamin-wide delegation enabled.

Raises: ValueError: if a unique model/version origin cannot be determined.

MLEngineModelOperator

class airflow.contrib.operators.mlengine_operator.MLEngineModelOperator(**kwargs)
Bases: airflow.models.BaseOperator
Operator for managing a Google Cloud ML Engine model.
Parameters
• project_id (string) – The Google Cloud project name to which MLEngine model
belongs. (templated)
• model (dict) – A dictionary containing the information about the model. If the operation
is create, then the model parameter should contain all the information about this model such
as name.
If the operation is get, the model parameter should contain the name of the model.
• operation (string) – The operation to perform. Available operations are:
– create: Creates a new model as provided by the model parameter.
– get: Gets a particular model where the name is specified in model.
• gcp_conn_id (string) – The connection ID to use when fetching connection info.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

MLEngineTrainingOperator

class airflow.contrib.operators.mlengine_operator.MLEngineTrainingOperator(**kwargs)
Bases: airflow.models.BaseOperator
Operator for launching a MLEngine training job.
Parameters
• project_id (string) – The Google Cloud project name within which MLEngine train-
ing job should run (templated).
• job_id (string) – A unique templated id for the submitted Google MLEngine training
job. (templated)
• package_uris (string) – A list of package locations for MLEngine training job,
which should include the main training program + any additional dependencies. (templated)
• training_python_module (string) – The Python module name to run within
MLEngine training job after installing ‘package_uris’ packages. (templated)
• training_args (string) – A list of templated command line arguments to pass to the
MLEngine training program. (templated)

3.16. Integration 201


Airflow Documentation, Release 1.10.2

• region (string) – The Google Compute Engine region to run the MLEngine training
job in (templated).
• scale_tier (string) – Resource tier for MLEngine training job. (templated)
• runtime_version (string) – The Google Cloud ML runtime version to use for train-
ing. (templated)
• python_version (string) – The version of Python used in training. (templated)
• job_dir (string) – A Google Cloud Storage path in which to store training outputs and
other data needed for training. (templated)
• gcp_conn_id (string) – The connection ID to use when fetching connection info.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• mode (string) – Can be one of ‘DRY_RUN’/’CLOUD’. In ‘DRY_RUN’ mode, no real
training job will be launched, but the MLEngine training job request will be printed out. In
‘CLOUD’ mode, a real MLEngine training job creation request will be issued.

MLEngineVersionOperator

class airflow.contrib.operators.mlengine_operator.MLEngineVersionOperator(**kwargs)
Bases: airflow.models.BaseOperator
Operator for managing a Google Cloud ML Engine version.
Parameters
• project_id (string) – The Google Cloud project name to which MLEngine model
belongs.
• model_name (string) – The name of the Google Cloud ML Engine model that the
version belongs to. (templated)
• version_name (string) – A name to use for the version being operated upon. If not
None and the version argument is None or does not have a value for the name key, then this
will be populated in the payload for the name key. (templated)
• version (dict) – A dictionary containing the information about the version. If the oper-
ation is create, version should contain all the information about this version such as name,
and deploymentUrl. If the operation is get or delete, the version parameter should contain
the name of the version. If it is None, the only operation possible would be list. (templated)
• operation (string) – The operation to perform. Available operations are:
– create: Creates a new version in the model specified by model_name, in which case
the version parameter should contain all the information to create that version (e.g. name,
deploymentUrl).
– get: Gets full information of a particular version in the model specified by model_name.
The name of the version should be specified in the version parameter.
– list: Lists all available versions of the model specified by model_name.
– delete: Deletes the version specified in version parameter from the model specified by
model_name). The name of the version should be specified in the version parameter.
• gcp_conn_id (string) – The connection ID to use when fetching connection info.

202 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

Cloud ML Engine Hook

MLEngineHook

class airflow.contrib.hooks.gcp_mlengine_hook.MLEngineHook(gcp_conn_id=’google_cloud_default’,
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
create_job(project_id, job, use_existing_job_fn=None)
Launches a MLEngine job and wait for it to reach a terminal state.
Parameters
• project_id (string) – The Google Cloud project id within which MLEngine job
will be launched.
• job (dict) – MLEngine Job object that should be provided to the MLEngine API, such
as:

{
'jobId': 'my_job_id',
'trainingInput': {
'scaleTier': 'STANDARD_1',
...
}
}

• use_existing_job_fn (function) – In case that a MLEngine job with the same


job_id already exist, this method (if provided) will decide whether we should use this
existing job, continue waiting for it to finish and returning the job object. It should accepts
a MLEngine job object, and returns a boolean value indicating whether it is OK to reuse
the existing job. If ‘use_existing_job_fn’ is not provided, we by default reuse the existing
MLEngine job.
Returns The MLEngine job object if the job successfully reach a terminal state (which might be
FAILED or CANCELLED state).
Return type dict
create_model(project_id, model)
Create a Model. Blocks until finished.
create_version(project_id, model_name, version_spec)
Creates the Version on Google Cloud ML Engine.
Returns the operation if the version was created successfully and raises an error otherwise.
delete_version(project_id, model_name, version_name)
Deletes the given version of a model. Blocks until finished.
get_conn()
Returns a Google MLEngine service object.
get_model(project_id, model_name)
Gets a Model. Blocks until finished.

3.16. Integration 203


Airflow Documentation, Release 1.10.2

list_versions(project_id, model_name)
Lists all available versions of a model. Blocks until finished.
set_default_version(project_id, model_name, version_name)
Sets a version to be the default. Blocks until finished.

3.16.5.13 Cloud Storage

Storage Operators

• FileToGoogleCloudStorageOperator : Uploads a file to Google Cloud Storage.


• GoogleCloudStorageCreateBucketOperator : Creates a new ACL entry on the specified bucket.
• GoogleCloudStorageBucketCreateAclEntryOperator : Creates a new cloud storage bucket.
• GoogleCloudStorageDownloadOperator : Downloads a file from Google Cloud Storage.
• GoogleCloudStorageListOperator : List all objects from the bucket with the give string prefix and delimiter in
name.
• GoogleCloudStorageToBigQueryOperator : Creates a new ACL entry on the specified object.
• GoogleCloudStorageObjectCreateAclEntryOperator : Loads files from Google cloud storage into BigQuery.
• GoogleCloudStorageToGoogleCloudStorageOperator : Copies objects from a bucket to another, with renaming
if requested.
• GoogleCloudStorageToGoogleCloudStorageTransferOperator : Copies objects from a bucket to another using
Google Transfer service.
• MySqlToGoogleCloudStorageOperator: Copy data from any MySQL Database to Google cloud storage in
JSON format.

FileToGoogleCloudStorageOperator

class airflow.contrib.operators.file_to_gcs.FileToGoogleCloudStorageOperator(**kwargs)
Bases: airflow.models.BaseOperator
Uploads a file to Google Cloud Storage. Optionally can compress the file for upload.
Parameters
• src (string) – Path to the local file. (templated)
• dst (string) – Destination path within the specified bucket. (templated)
• bucket (string) – The bucket to upload to. (templated)
• google_cloud_storage_conn_id (string) – The Airflow connection ID to up-
load with
• mime_type (string) – The mime-type string
• delegate_to (str) – The account to impersonate, if any
• gzip (bool) – Allows for file to be compressed and uploaded as gzip
execute(context)
Uploads the file to Google cloud storage

204 Chapter 3. Content


Airflow Documentation, Release 1.10.2

GoogleCloudStorageBucketCreateAclEntryOperator

class airflow.contrib.operators.gcs_acl_operator.GoogleCloudStorageBucketCreateAclEntryOper
Bases: airflow.models.BaseOperator
Creates a new ACL entry on the specified bucket.
Parameters
• bucket (str) – Name of a bucket.
• entity (str) – The entity holding the permission, in one of the following forms: user-
userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId,
allUsers, allAuthenticatedUsers
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”, “WRITER”.
• user_project (str) – (Optional) The project to be billed for this request. Required for
Requester Pays buckets.
• google_cloud_storage_conn_id (str) – The connection ID to use when connect-
ing to Google Cloud Storage.

GoogleCloudStorageCreateBucketOperator

class airflow.contrib.operators.gcs_operator.GoogleCloudStorageCreateBucketOperator(**kwargs
Bases: airflow.models.BaseOperator
Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name
that is already in use.
See also:
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/
bucketnaming.html#requirements

Parameters
• bucket_name (string) – The name of the bucket. (templated)
• storage_class (string) – This defines how objects in the bucket are stored and de-
termines the SLA and the cost of storage (templated). Values include
– MULTI_REGIONAL
– REGIONAL
– STANDARD
– NEARLINE
– COLDLINE.
If this value is not specified when the bucket is created, it will default to STANDARD.
• location (string) – The location of the bucket. (templated) Object data for objects in
the bucket resides in physical storage within this region. Defaults to US.
See also:
https://developers.google.com/storage/docs/bucket-locations
• project_id (string) – The ID of the GCP Project. (templated)

3.16. Integration 205


Airflow Documentation, Release 1.10.2

• labels (dict) – User-provided labels, in key/value pairs.


• google_cloud_storage_conn_id (string) – The connection ID to use when con-
necting to Google cloud storage.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

Example: The following Operator would create a new bucket test-bucket with MULTI_REGIONAL stor-
age class in EU region

CreateBucket = GoogleCloudStorageCreateBucketOperator(
task_id='CreateNewBucket',
bucket_name='test-bucket',
storage_class='MULTI_REGIONAL',
location='EU',
labels={'env': 'dev', 'team': 'airflow'},
google_cloud_storage_conn_id='airflow-service-account'
)

GoogleCloudStorageDownloadOperator

class airflow.contrib.operators.gcs_download_operator.GoogleCloudStorageDownloadOperator(**
Bases: airflow.models.BaseOperator
Downloads a file from Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is. (templated)
• object (string) – The name of the object to download in the Google cloud storage
bucket. (templated)
• filename (string) – The file path on the local file system (where the operator is being
executed) that the file should be downloaded to. (templated) If no filename passed, the
downloaded data will not be stored on the local file system.
• store_to_xcom_key (string) – If this param is set, the operator will push the con-
tents of the downloaded file to XCom with the key set in this parameter. If not set, the
downloaded data will not be pushed to XCom. (templated)
• google_cloud_storage_conn_id (string) – The connection ID to use when con-
necting to Google cloud storage.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

GoogleCloudStorageListOperator

class airflow.contrib.operators.gcs_list_operator.GoogleCloudStorageListOperator(**kwargs)
Bases: airflow.models.BaseOperator
List all objects from the bucket with the give string prefix and delimiter in name.
This operator returns a python list with the name of objects which can be used by xcom in the down-
stream task.

206 Chapter 3. Content


Airflow Documentation, Release 1.10.2

Parameters
• bucket (string) – The Google cloud storage bucket to find the objects. (templated)
• prefix (string) – Prefix string which filters objects whose name begin with this prefix.
(templated)
• delimiter (string) – The delimiter by which you want to filter the objects. (templated)
For e.g to lists the CSV files from in a directory in GCS you would use delimiter=’.csv’.
• google_cloud_storage_conn_id (string) – The connection ID to use when con-
necting to Google cloud storage.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

Example: The following Operator would list all the Avro files from sales/sales-2017 folder in data
bucket.

GCS_Files = GoogleCloudStorageListOperator(
task_id='GCS_Files',
bucket='data',
prefix='sales/sales-2017/',
delimiter='.avro',
google_cloud_storage_conn_id=google_cloud_conn_id
)

GoogleCloudStorageObjectCreateAclEntryOperator

class airflow.contrib.operators.gcs_acl_operator.GoogleCloudStorageObjectCreateAclEntryOper
Bases: airflow.models.BaseOperator
Creates a new ACL entry on the specified object.
Parameters
• bucket (str) – Name of a bucket.
• object_name (str) – Name of the object. For information about how to URL encode ob-
ject names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding
• entity (str) – The entity holding the permission, in one of the following forms: user-
userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId,
allUsers, allAuthenticatedUsers
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”.
• generation (str) – (Optional) If present, selects a specific revision of this object (as
opposed to the latest version, the default).
• user_project (str) – (Optional) The project to be billed for this request. Required for
Requester Pays buckets.
• google_cloud_storage_conn_id (str) – The connection ID to use when connect-
ing to Google Cloud Storage.

3.16. Integration 207


Airflow Documentation, Release 1.10.2

GoogleCloudStorageToBigQueryOperator

class airflow.contrib.operators.gcs_to_bq.GoogleCloudStorageToBigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Loads files from Google cloud storage into BigQuery.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly
pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in
Google cloud storage must be a JSON file with the schema fields in it.
Parameters
• bucket (string) – The bucket to load from. (templated)
• source_objects (list of str) – List of Google cloud storage URIs to load from.
(templated) If source_format is ‘DATASTORE_BACKUP’, the list must only contain a sin-
gle URI.
• destination_project_dataset_table (string) – The dotted
(<project>.)<dataset>.<table> BigQuery table to load data into. If <project> is not
included, project will be the project defined in the connection json. (templated)
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/v2/jobs#configuration.load Should not be set when
source_format is ‘DATASTORE_BACKUP’.
• schema_object (string) – If set, a GCS object path pointing to a .json file that con-
tains the schema for the table. (templated)
• source_format (string) – File format to export.
• compression (string) – [Optional] The compression type of the data source. Possible
values include GZIP and NONE. The default value is NONE. This setting is ignored for
Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.
• create_disposition (string) – The create disposition if the table doesn’t exist.
• skip_leading_rows (int) – Number of rows to skip when loading from a CSV.
• write_disposition (string) – The write disposition if the table already exists.
• field_delimiter (string) – The delimiter to use when loading from a CSV.
• max_bad_records (int) – The maximum number of bad records that BigQuery can
ignore when running the job.
• quote_character (string) – The value that is used to quote data sections in a CSV
file.
• ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow
extra values that are not represented in the table schema. If true, the extra values are ignored.
If false, records with extra columns are treated as bad records, and if there are too many bad
records, an invalid error is returned in the job result.
• allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not
(false).
• allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns.
The missing values are treated as nulls. If false, records with missing trailing columns are
treated as bad records, and if there are too many bad records, an invalid error is returned in
the job result. Only applicable to CSV, ignored for other formats.

208 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• max_id_key (string) – If set, the name of a column in the BigQuery table that’s to be
loaded. This will be used to select the MAX value from BigQuery after the load occurs. The
results will be returned by the execute() command, which in turn gets stored in XCom for
future operators to use. This can be helpful with incremental loads–during future executions,
you can pick up from the max ID.
• bigquery_conn_id (string) – Reference to a specific BigQuery hook.
• google_cloud_storage_conn_id (string) – Reference to a specific Google
cloud storage hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• schema_update_options (list) – Allows the schema of the destination table to be
updated as a side effect of the load job.
• src_fmt_configs (dict) – configure optional fields specific to the source format
• external_table (bool) – Flag to specify if the destination table should be a BigQuery
external table. Default Value is False.
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications. Note that ‘field’ is not available in
concurrency with dataset.table$partition.
• cluster_fields (list of str) – Request that the result of this load be stored sorted
by one or more columns. This is only available in conjunction with time_partitioning. The
order of columns given determines the sort order. Not applicable for external tables.

GoogleCloudStorageToGoogleCloudStorageOperator

class airflow.contrib.operators.gcs_to_gcs.GoogleCloudStorageToGoogleCloudStorageOperator(*
Bases: airflow.models.BaseOperator
Copies objects from a bucket to another, with renaming if requested.
Parameters
• source_bucket (string) – The source Google cloud storage bucket where the object
is. (templated)
• source_object (string) – The source name of the object to copy in the Google cloud
storage bucket. (templated) If wildcards are used in this argument:
You can use only one wildcard for objects (filenames) within your bucket. The wildcard
can appear inside the object name or at the end of the object name. Appending a
wildcard to the bucket name is unsupported.
• destination_bucket (string) – The destination Google cloud storage bucket where
the object should be. (templated)
• destination_object (string) – The destination name of the object in the destina-
tion Google cloud storage bucket. (templated) If a wildcard is supplied in the source_object
argument, this is the prefix that will be prepended to the final destination objects’ paths.
Note that the source path’s part before the wildcard will be removed; if it needs to be re-
tained it should be appended to destination_object. For example, with prefix foo/* and
destination_object blah/, the file foo/baz will be copied to blah/baz; to retain the
prefix write the destination_object as e.g. blah/foo, in which case the copied file will be
named blah/foo/baz.

3.16. Integration 209


Airflow Documentation, Release 1.10.2

• move_object (bool) – When move object is True, the object is moved instead of copied
to the new location. This is the equivalent of a mv command as opposed to a cp command.
• google_cloud_storage_conn_id (string) – The connection ID to use when con-
necting to Google cloud storage.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.

Examples: The following Operator would copy a single file named sales/sales-2017/january.avro
in the data bucket to the file named copied_sales/2017/january-backup.avro` in the
``data_backup bucket

copy_single_file = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='copy_single_file',
source_bucket='data',
source_object='sales/sales-2017/january.avro',
destination_bucket='data_backup',
destination_object='copied_sales/2017/january-backup.avro',
google_cloud_storage_conn_id=google_cloud_conn_id
)

The following Operator would copy all the Avro files from sales/sales-2017 folder (i.e. with names
starting with that prefix) in data bucket to the copied_sales/2017 folder in the data_backup
bucket.

copy_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='copy_files',
source_bucket='data',
source_object='sales/sales-2017/*.avro',
destination_bucket='data_backup',
destination_object='copied_sales/2017/',
google_cloud_storage_conn_id=google_cloud_conn_id
)

The following Operator would move all the Avro files from sales/sales-2017 folder (i.e. with names
starting with that prefix) in data bucket to the same folder in the data_backup bucket, deleting the
original files in the process.

move_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='move_files',
source_bucket='data',
source_object='sales/sales-2017/*.avro',
destination_bucket='data_backup',
move_object=True,
google_cloud_storage_conn_id=google_cloud_conn_id
)

GoogleCloudStorageToGoogleCloudStorageTransferOperator

class airflow.contrib.operators.gcs_to_gcs_transfer_operator.GoogleCloudStorageToGoogleClou
Bases: airflow.models.BaseOperator
Copies objects from a bucket to another using the GCP Storage Transfer Service.
Parameters

210 Chapter 3. Content


Airflow Documentation, Release 1.10.2

• source_bucket (str) – The source Google cloud storage bucket where the object is.
(templated)
• destination_bucket (str) – The destination Google cloud storage bucket where the
object should be. (templated)
• project_id (str) – The ID of the Google Cloud Platform Console project that owns the
job
• gcp_conn_id (str) – Optional connection ID to use when connecting to Google Cloud
Storage.
• delegate_to (str) – The account to impersonate, if any. For this to work, the service
account making the request must have domain-wide delegation enabled.
• description (str) – Optional transfer service job description
• schedule (dict) – Optional transfer service schedule; see https://cloud.google.com/
storage-transfer/docs/reference/rest/v1/transferJobs. If not set, run transfer job once as soon
as the operator runs
• object_conditions (dict) – Optional transfer service object conditions; see https://
cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#ObjectConditions
• transfer_options (dict) – Optional transfer service transfer options; see https://
cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#TransferOptions
• wait (bool) – Wait for transfer to finish; defaults to True
Example:
gcs_to_gcs_transfer_op = GoogleCloudStorageToGoogleCloudStorageTransferOperator(
task_id='gcs_to_gcs_transfer_example',
source_bucket='my-source-bucket',
destination_bucket='my-destination-bucket',
project_id='my-gcp-project',
dag=my_dag)

MySqlToGoogleCloudStorageOperator

GoogleCloudStorageHook

class airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook(google_cloud_storage_conn_id=’google_clou
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Interact with Google Cloud Storage. This hook uses the Google Cloud Platform connection.
copy(source_bucket, source_object, destination_bucket=None, destination_object=None)
Copies an object from a bucket to another, with renaming if requested.
destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but
not both.
Parameters
• source_bucket (string) – The bucket of the object to copy from.
• source_object (string) – The object to copy.
• destination_bucket (string) – The destination of the object to copied to. Can
be omitted; then the same bucket is used.

3.16. Integration 211


Airflow Documentation, Release 1.10.2

• destination_object (string) – The (renamed) path of the object if given. Can


be omitted; then the same name is used.
create_bucket(bucket_name, storage_class=’MULTI_REGIONAL’, location=’US’,
project_id=None, labels=None)
Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a
name that is already in use.
See also:
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/
bucketnaming.html#requirements

Parameters
• bucket_name (string) – The name of the bucket.
• storage_class (string) – This defines how objects in the bucket are stored and
determines the SLA and the cost of storage. Values include
– MULTI_REGIONAL
– REGIONAL
– STANDARD
– NEARLINE
– COLDLINE.
If this value is not specified when the bucket is created, it will default to STANDARD.
• location (string) – The location of the bucket. Object data for objects in the bucket
resides in physical storage within this region. Defaults to US.
See also:
https://developers.google.com/storage/docs/bucket-locations
• project_id (string) – The ID of the GCP Project.
• labels (dict) – User-provided labels, in key/value pairs.
Returns If successful, it returns the id of the bucket.

delete(bucket, object, generation=None)


Delete an object if versioning is not enabled for the bucket, or if generation parameter is used.
Parameters
• bucket (string) – name of the bucket, where the object resides
• object (string) – name of the object to delete
• generation (string) – if present, permanently delete the object of this generation
Returns True if succeeded
download(bucket, object, filename=None)
Get a file from Google Cloud Storage.
Parameters
• bucket (string) – The bucket to fetch from.
• object (string) – The object to fetch.
• filename (string) – If set, a local file path where the file should be written to.

212 Chapter 3. Content

You might also like