Apache Airflow Documentation
Apache Airflow Documentation
3.10.2.29 clear
airflow clear [-h] [-t TASK_REGEX] [-s START_DATE] [-e END_DATE] [-sd SUBDIR]
[-u] [-d] [-c] [-f] [-r] [-x] [-xp] [-dx]
dag_id
Positional Arguments
Named Arguments
3.10.2.30 list_users
3.10.2.31 next_execution
Positional Arguments
Named Arguments
-sd, --subdir File location or directory from which to look for the dag. Defaults to ‘[AIR-
FLOW_HOME]/dags’ where [AIRFLOW_HOME] is the value you set for ‘AIR-
FLOW_HOME’ config you set in ‘airflow.cfg’
Default: “[AIRFLOW_HOME]/dags”
3.10.2.32 upgradedb
3.10.2.33 delete_dag
Positional Arguments
Named Arguments
The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been
met. Behind the scenes, it spins up a subprocess, which monitors and stays in sync with a folder for all DAG objects
it may contain, and periodically (every minute or so) collects DAG parsing results and inspects active tasks to see
whether they can be triggered.
The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off,
all you need to do is execute airflow scheduler. It will use the configuration specified in airflow.cfg.
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger
soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of
the period.
The scheduler starts an instance of the executor specified in the your airflow.cfg. If it happens to
be the LocalExecutor, tasks will be executed as subprocesses; in the case of CeleryExecutor and
MesosExecutor, tasks are executed remotely.
To start a scheduler, simply run the command:
airflow scheduler
Note: Use schedule_interval=None and not schedule_interval='None' when you don’t want to
schedule your DAG.
Your DAG will be instantiated for each schedule, while creating a DAG Run entry for each schedule.
DAG runs have a state associated to them (running, failed, success) and informs the scheduler on which set of schedules
should be evaluated for task submissions. Without the metadata at the DAG run level, the Airflow scheduler would
have much more work to do in order to figure out what tasks should be triggered and come to a crawl. It might also
create undesired processing when changing the shape of your DAG, by say adding in new tasks.
An Airflow DAG with a start_date, possibly an end_date, and a schedule_interval defines a series of
intervals which the scheduler turn into individual Dag Runs and execute. A key capability of Airflow is that these
DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine the lifetime of the DAG (from
start to end/now, one interval at a time) and kick off a DAG Run for any interval that has not been run (or has been
cleared). This concept is called Catchup.
If your DAG is written to handle its own catchup (IE not limited to the interval, but instead to “Now” for instance.),
then you will want to turn catchup off (Either on the DAG itself with dag.catchup = False) or by default at the
configuration file level with catchup_by_default = False. What this will do, is to instruct the scheduler to
only create a DAG Run for the most current instance of the DAG interval series.
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/apache/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': '@hourly',
}
In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, (or from the command
line), a single DAG Run will be created, with an execution_date of 2016-01-01, and the next one will be created
just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02.
If the dag.catchup value had been True instead, the scheduler would have created a DAG Run for each completed
interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval hasn’t completed) and
the scheduler will execute them sequentially. This behavior is great for atomic datasets that can easily be split into
periods. Turning catchup off is great if your DAG Runs perform backfill internally.
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag
command, where you can define a specific run_id. The DAG Runs created externally to the scheduler get associated
to the trigger’s timestamp, and will be displayed in the UI alongside scheduled DAG runs.
• The first DAG Run is created based on the minimum start_date for the tasks in your DAG.
• Subsequent DAG Runs are created by the scheduler process, based on your DAG’s schedule_interval,
sequentially.
• When clearing a set of tasks’ state in hope of getting them to re-run, it is important to keep in mind the DAG
Run’s state too as it defines whether the scheduler should look into triggering tasks for that run.
3.12 Plugins
Airflow has a simple plugin manager built-in that can integrate external features to its core by simply dropping files in
your $AIRFLOW_HOME/plugins folder.
The python modules in the plugins folder get imported, and hooks, operators, sensors, macros, executors and
web views get integrated to Airflow’s main collections and become available for use.
Airflow offers a generic toolbox for working with data. Different organizations have different stacks and different
needs. Using Airflow plugins can be a way for companies to customize their Airflow installation to reflect their
ecosystem.
Plugins can be used as an easy way to write, share and activate new sets of features.
There’s also a need for a set of more complex applications to interact with different flavors of data and metadata.
Examples:
• A set of tools to parse Hive logs and expose Hive metadata (CPU /IO / phases/ skew /. . . )
• An anomaly detection framework, allowing people to collect metrics, set thresholds and alerts
• An auditing tool, helping understand who accesses what
• A config-driven SLA monitoring tool, allowing you to set monitored tables and at what time they should land,
alert people, and expose visualizations of outages
• ...
Airflow has many components that can be reused when building an application:
• A web server you can use to render your views
3.12.3 Interface
To create a plugin you will need to derive the airflow.plugins_manager.AirflowPlugin class and refer-
ence the objects you want to plug into Airflow. Here’s what the class you need to derive looks like:
class AirflowPlugin(object):
# The name of your plugin (str)
name = None
# A list of class(es) derived from BaseOperator
operators = []
# A list of class(es) derived from BaseSensorOperator
sensors = []
# A list of class(es) derived from BaseHook
hooks = []
# A list of class(es) derived from BaseExecutor
executors = []
# A list of references to inject into the macros namespace
macros = []
# A list of objects created from a class derived
# from flask_admin.BaseView
admin_views = []
# A list of Blueprint object created from flask.Blueprint. For use with the flask_
˓→admin based GUI
flask_blueprints = []
# A list of menu links (flask_admin.base.MenuLink). For use with the flask_admin
˓→based GUI
menu_links = []
# A list of dictionaries containing FlaskAppBuilder BaseView object and some
˓→metadata. See example below
appbuilder_views = []
# A list of dictionaries containing FlaskAppBuilder BaseView object and some
˓→metadata. See example below
appbuilder_menu_items = []
3.12.4 Example
The code below defines a plugin that injects a set of dummy object definitions in Airflow.
static_folder='static',
static_url_path='/static/test_plugin')
ml = MenuLink(
category='Test Plugin',
name='Test Menu Link',
url='https://airflow.apache.org/')
Airflow 1.10 introduced role based views using FlaskAppBuilder. You can configure which UI is used by setting rbac
= True. To support plugin views and links for both versions of the UI and maintain backwards compatibility, the fields
appbuilder_views and appbuilder_menu_items were added to the AirflowTestPlugin class.
# my_package/my_plugin.py
from airflow.plugins_manager import AirflowPlugin
from airflow.models import BaseOperator
from airflow.hooks.base_hook import BaseHook
class MyOperator(BaseOperator):
pass
class MyHook(BaseHook):
pass
class MyAirflowPlugin(AirflowPlugin):
name = 'my_namespace'
operators = [MyOperator]
hooks = [MyHook]
setup(
name="my-package",
...
entry_points = {
'airflow.plugins': [
'my_plugin = my_package.my_plugin:MyAirflowPlugin'
]
}
)
3.13 Security
By default, all gates are opened. An easy way to restrict access to the web application is to do it at the network level,
or by using SSH tunnels.
It is however possible to switch on authentication by either using one of the supplied backends or creating your own.
Be sure to checkout Experimental Rest API for securing the API.
Note: Airflow uses the config parser of Python. This config parser interpolates ‘%’-signs. Make sure escape any %
signs in your config file (but not environment variables) as %%, otherwise Airflow might leak these passwords on a
config parser exception to a log.
3.13.1.1 Password
Note: This is for flask-admin based web UI only. If you are using FAB-based web UI with RBAC feature, please use
command line interface create_user to create accounts, or do that in the FAB-based UI itself.
One of the simplest mechanisms for authentication is requiring users to specify a password before logging in. Password
authentication requires the used of the password subpackage in your requirements file. Password hashing uses
bcrypt before storing passwords.
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
When password auth is enabled, an initial user credential will need to be created before anyone can login. An initial
user was not created in the migrations for this authentication backend to prevent default Airflow installations from
attack. Creating a new user has to be done via a Python REPL on the same machine Airflow is installed.
3.13.1.2 LDAP
To turn on LDAP authentication configure your airflow.cfg as follows. Please note that the example uses an
encrypted connection to the ldap server as we do not want passwords be readable on the network level.
Additionally, if you are using Active Directory, and are not explicitly specifying an OU that your users are in, you will
need to change search_scope to “SUBTREE”.
Valid search_scope options can be found in the ldap3 Documentation
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.ldap_auth
[ldap]
# set a connection without encryption: uri = ldap://<your.ldap.server>:<port>
uri = ldaps://<your.ldap.server>:<port>
user_filter = objectClass=*
# in case of Active Directory you would use: user_name_attr = sAMAccountName
user_name_attr = uid
# group_member_attr should be set accordingly with *_filter
# eg :
# group_member_attr = groupMembership
# superuser_filter = groupMembership=CN=airflow-super-users...
group_member_attr = memberOf
superuser_filter = memberOf=CN=airflow-super-users,OU=Groups,OU=RWC,OU=US,OU=NORAM,
˓→DC=example,DC=com
data_profiler_filter = memberOf=CN=airflow-data-profilers,OU=Groups,OU=RWC,OU=US,
˓→OU=NORAM,DC=example,DC=com
bind_user = cn=Manager,dc=example,dc=com
bind_password = insecure
basedn = dc=example,dc=com
cacert = /etc/ca/ldap_ca.crt
# Set search_scope to one of them: BASE, LEVEL , SUBTREE
# Set search_scope to SUBTREE if using Active Directory, and not specifying an
˓→Organizational Unit
search_scope = LEVEL
The superuser_filter and data_profiler_filter are optional. If defined, these configurations allow you to specify LDAP
groups that users must belong to in order to have superuser (admin) and data-profiler permissions. If undefined, all
users will be superusers and data profilers.
Airflow uses flask_login and exposes a set of hooks in the airflow.default_login module. You can alter
the content and make it part of the PYTHONPATH and configure it as a backend in airflow.cfg.
[webserver]
authenticate = True
auth_backend = mypackage.auth
3.13.2 Multi-tenancy
You can filter the list of dags in webserver by owner name when authentication is turned on by setting
webserver:filter_by_owner in your config. With this, a user will see only the dags which it is owner of,
unless it is a superuser.
[webserver]
filter_by_owner = True
3.13.3 Kerberos
Airflow has initial support for Kerberos. This means that airflow can renew kerberos tickets for itself and store it in
the ticket cache. The hooks and dags can make use of ticket to authenticate against kerberized services.
3.13.3.1 Limitations
Please note that at this time, not all hooks have been adjusted to make use of this functionality. Also it does not
integrate kerberos into the web interface and you will have to rely on network level security for now to make sure your
service remains secure.
Celery integration has not been tried and tested yet. However, if you generate a key tab for every host and launch a
ticket renewer next to every worker it will most likely work.
Airflow
# Create the airflow keytab file that will contain the airflow principal
kadmin: xst -norandkey -k airflow.keytab airflow/fully.qualified.domain.name
Now store this file in a location where the airflow user can read it (chmod 600). And then add the following to your
airflow.cfg
[core]
security = kerberos
[kerberos]
keytab = /etc/airflow/airflow.keytab
reinit_frequency = 3600
principal = airflow
Hadoop
If want to use impersonation this needs to be enabled in core-site.xml of your hadoop config.
<property>
<name>hadoop.proxyuser.airflow.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.airflow.users</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.airflow.hosts</name>
<value>*</value>
</property>
Of course if you need to tighten your security replace the asterisk with something more appropriate.
The hive hook has been updated to take advantage of kerberos authentication. To allow your DAGs to use it, simply
update the connection details with, for example:
Adjust the principal to your settings. The _HOST part will be replaced by the fully qualified domain name of the
server.
You can specify if you would like to use the dag owner as the user for the connection or the user specified in the login
section of the connection. For the login user, specify the following as extra:
run_as_owner=True
To use kerberos authentication, you must install Airflow with the kerberos extras group:
The GitHub Enterprise authentication backend can be used to authenticate users against an installation of GitHub
Enterprise using OAuth2. You can optionally specify a team whitelist (composed of slug cased team names) to restrict
login to only members of those teams.
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.github_enterprise_auth
[github_enterprise]
host = github.example.com
client_id = oauth_key_from_github_enterprise
client_secret = oauth_secret_from_github_enterprise
oauth_callback_route = /example/ghe_oauth/callback
allowed_teams = 1, 345, 23
Note: If you do not specify a team whitelist, anyone with a valid account on your GHE installation will be able to
login to Airflow.
To use GHE authentication, you must install Airflow with the github_enterprise extras group:
An application must be setup in GHE before you can use the GHE authentication backend. In order to setup an
application:
1. Navigate to your GHE profile
2. Select ‘Applications’ from the left hand nav
3. Select the ‘Developer Applications’ tab
4. Click ‘Register new application’
5. Fill in the required information (the ‘Authorization callback URL’ must be fully qualified e.g. http://airflow.
example.com/example/ghe_oauth/callback)
6. Click ‘Register application’
7. Copy ‘Client ID’, ‘Client Secret’, and your callback route to your airflow.cfg according to the above example
The Google authentication backend can be used to authenticate users against Google using OAuth2. You must specify
the domains to restrict login, separated with a comma, to only members of those domains.
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.google_auth
[google]
client_id = google_client_id
client_secret = google_client_secret
oauth_callback_route = /oauth2callback
domain = "example1.com,example2.com"
To use Google authentication, you must install Airflow with the google_auth extras group:
An application must be setup in the Google API Console before you can use the Google authentication backend. In
order to setup an application:
1. Navigate to https://console.developers.google.com/apis/
2. Select ‘Credentials’ from the left hand nav
3. Click ‘Create credentials’ and choose ‘OAuth client ID’
4. Choose ‘Web application’
5. Fill in the required information (the ‘Authorized redirect URIs’ must be fully qualified e.g. http://airflow.
example.com/oauth2callback)
6. Click ‘Create’
7. Copy ‘Client ID’, ‘Client Secret’, and your redirect URI to your airflow.cfg according to the above example
3.13.5 SSL
SSL can be enabled by providing a certificate and key. Once enabled, be sure to use “https://” in your browser.
[webserver]
web_server_ssl_cert = <path to cert>
web_server_ssl_key = <path to key>
Enabling SSL will not automatically change the web server port. If you want to use the standard port 443, you’ll need
to configure that too. Be aware that super user privileges (or cap_net_bind_service on Linux) are required to listen on
port 443.
Enable CeleryExecutor with SSL. Ensure you properly generate client and server certs and keys.
[celery]
ssl_active = True
ssl_key = <path to key>
ssl_cert = <path to cert>
ssl_cacert = <path to cacert>
3.13.6 Impersonation
Airflow has the ability to impersonate a unix user while running task instances based on the task’s run_as_user
parameter, which takes a user’s name.
NOTE: For impersonations to work, Airflow must be run with sudo as subtasks are run with sudo -u and permissions
of files are changed. Furthermore, the unix user needs to exist on the worker. Here is what a simple sudoers file entry
could look like to achieve this, assuming as airflow is running as the airflow user. Note that this means that the airflow
user must be trusted and treated the same way as the root user.
Subtasks with impersonation will still log to the same folder, except that the files they log to will have permissions
changed such that only the unix user can write to it.
To prevent tasks that don’t use impersonation to be run with sudo privileges, you can set the
core:default_impersonation config which sets a default user impersonate if run_as_user is not set.
[core]
default_impersonation = airflow
[celery]
flower_basic_auth = user1:password1,user2:password2
Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the
database. It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not
convert them to the end user’s time zone in the user interface. There it will always be displayed in UTC. Also
templates used in Operators are not converted. Time zone information is exposed and it is up to the writer of DAG
what do with it.
This is handy if your users live in more than one time zone and you want to display datetime information according to
each user’s wall clock.
Even if you are running Airflow in only one time zone it is still good practice to store data in UTC in your database
(also before Airflow became time zone aware this was also to recommended or even required setup). The main reason
is Daylight Saving Time (DST). Many countries have a system of DST, where clocks are moved forward in spring
and backward in autumn. If you’re working in local time, you’re likely to encounter errors twice a year, when the
transitions happen. (The pendulum and pytz documentation discusses these issues in greater detail.) This probably
doesn’t matter for a simple DAG, but it’s a problem if you are in, for example, financial services where you have end
of day deadlines to meet.
The time zone is set in airflow.cfg. By default it is set to utc, but you change it to use the system’s settings or an
arbitrary IANA time zone, e.g. Europe/Amsterdam. It is dependent on pendulum, which is more accurate than pytz.
Pendulum is installed when you install Airflow.
Please note that the Web UI currently only runs in UTC.
3.14.1 Concepts
Python’s datetime.datetime objects have a tzinfo attribute that can be used to store time zone information, represented
as an instance of a subclass of datetime.tzinfo. When this attribute is set and describes an offset, a datetime object is
aware. Otherwise, it’s naive.
You can use timezone.is_aware() and timezone.is_naive() to determine whether datetimes are aware or naive.
Because Airflow uses time-zone-aware datetime objects. If your code creates datetime objects they need to be aware
too.
now = timezone.utcnow()
a_date = timezone.datetime(2017,1,1)
Although Airflow operates fully time zone aware, it still accepts naive date time objects for start_dates and end_dates
in your DAG definitions. This is mostly in order to preserve backwards compatibility. In case a naive start_date or
end_date is encountered the default time zone is applied. It is applied in such a way that it is assumed that the naive date
time is already in the default time zone. In other words if you have a default time zone setting of Europe/Amsterdam
and create a naive datetime start_date of datetime(2017,1,1) it is assumed to be a start_date of Jan 1, 2017 Amsterdam
time.
default_args=dict(
start_date=datetime(2016, 1, 1),
owner='Airflow'
)
Unfortunately, during DST transitions, some datetimes don’t exist or are ambiguous. In such situations, pendulum
raises an exception. That’s why you should always create aware datetime objects when time zone support is enabled.
In practice, this is rarely an issue. Airflow gives you aware datetime objects in the models and DAGs, and most often,
new datetime objects are created from existing ones through timedelta arithmetic. The only datetime that’s often
created in application code is the current time, and timezone.utcnow() automatically does the right thing.
The default time zone is the time zone defined by the default_timezone setting under [core]. If you just in-
stalled Airflow it will be set to utc, which is recommended. You can also set it to system or an IANA time zone
(e.g.‘Europe/Amsterdam‘). DAGs are also evaluated on Airflow workers, it is therefore important to make sure this
setting is equal on all Airflow nodes.
[core]
default_timezone = utc
Creating a time zone aware DAG is quite simple. Just make sure to supply a time zone aware start_date. It is
recommended to use pendulum for this, but pytz (to be installed manually) can also be used for this.
import pendulum
local_tz = pendulum.timezone("Europe/Amsterdam")
default_args=dict(
start_date=datetime(2016, 1, 1, tzinfo=local_tz),
owner='Airflow'
)
Please note that while it is possible to set a start_date and end_date for Tasks always the DAG timezone or global
timezone (in that order) will be used to calculate the next execution date. Upon first encounter the start date or end
date will be converted to UTC using the timezone associated with start_date or end_date, then for calculations this
timezone information will be disregarded.
3.14.2.1 Templates
Airflow returns time zone aware datetimes in templates, but does not convert them to local time so they remain in
UTC. It is left up to the DAG to handle this.
import pendulum
local_tz = pendulum.timezone("Europe/Amsterdam")
local_tz.convert(execution_date)
In case you set a cron schedule, Airflow assumes you will always want to run at the exact same time. It will then
ignore day light savings time. Thus, if you have a schedule that says run at end of interval every day at 08:00 GMT+1
it will always run end of interval 08:00 GMT+1, regardless if day light savings time is in place.
For schedules with time deltas Airflow assumes you always will want to run with the specified interval. So if you
specify a timedelta(hours=2) you will always want to run to hours later. In this case day light savings time will be
taken into account.
Airflow exposes an experimental Rest API. It is available through the webserver. Endpoints are available at
/api/experimental/. Please note that we expect the endpoint definitions to change.
3.15.1 Endpoints
POST /api/experimental/dags/<DAG_ID>/dag_runs
Creates a dag_run for a given dag id.
Trigger DAG with config, example:
curl -X POST \
http://localhost:8080/api/experimental/dags/<DAG_ID>/dag_runs \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{"conf":"{\"key\":\"value\"}"}'
GET /api/experimental/dags/<DAG_ID>/dag_runs
Returns a list of Dag Runs for a specific DAG ID.
GET /api/experimental/dags/<string:dag_id>/dag_runs/<string:execution_date>
Returns a JSON with a dag_run’s public instance variables. The format for the <string:execution_date> is
expected to be “YYYY-mm-DDTHH:MM:SS”, for example: “2016-11-16T11:34:15”.
GET /api/experimental/test
To check REST API server correct work. Return status ‘OK’.
GET /api/experimental/dags/<DAG_ID>/tasks/<TASK_ID>
Returns info for a task.
GET /api/experimental/dags/<DAG_ID>/dag_runs/<string:execution_date>/tasks/<TASK_ID>
Returns a JSON with a task instance’s public instance variables. The format for the <string:execution_date> is
expected to be “YYYY-mm-DDTHH:MM:SS”, for example: “2016-11-16T11:34:15”.
GET /api/experimental/dags/<DAG_ID>/paused/<string:paused>
‘<string:paused>’ must be a ‘true’ to pause a DAG and ‘false’ to unpause.
GET /api/experimental/latest_runs
Returns the latest DagRun for each DAG formatted for the UI.
GET /api/experimental/pools
Get all pools.
GET /api/experimental/pools/<string:name>
Get pool by a given name.
POST /api/experimental/pools
Create a pool.
DELETE /api/experimental/pools/<string:name>
Delete pool.
3.15.2 CLI
For some functions the cli can use the API. To configure the CLI to use the API when available configure as follows:
[cli]
api_client = airflow.api.client.json_client
endpoint_url = http://<WEBSERVER>:<PORT>
3.15.3 Authentication
Authentication for the API is handled separately to the Web Authentication. The default is to not require any au-
thentication on the API – i.e. wide open by default. This is not recommended if your Airflow webserver is publicly
accessible, and you should probably use the deny all backend:
[api]
auth_backend = airflow.api.auth.backend.deny_all
Two “real” methods for authentication are currently supported for the API.
To enabled Password authentication, set the following in the configuration:
[api]
auth_backend = airflow.contrib.auth.backends.password_auth
It’s usage is similar to the Password Authentication used for the Web interface.
To enable Kerberos authentication, set the following in the configuration:
[api]
auth_backend = airflow.api.auth.backend.kerberos_auth
[kerberos]
keytab = <KEYTAB>
3.16 Integration
• Reverse Proxy
• Azure: Microsoft Azure
• AWS: Amazon Web Services
• Databricks
• GCP: Google Cloud Platform
• Qubole
Airflow can be set up behind a reverse proxy, with the ability to set its endpoint with great flexibility.
For example, you can configure your reverse proxy to get:
https://lab.mycompany.com/myorg/airflow/
base_url = http://my_host/myorg/airflow
Additionally if you use Celery Executor, you can get Flower in /myorg/flower with:
flower_url_prefix = /myorg/flower
server {
listen 80;
server_name lab.mycompany.com;
location /myorg/airflow/ {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
server {
listen 80;
server_name lab.mycompany.com;
location /myorg/flower/ {
rewrite ^/myorg/flower/(.*)$ /$1 break; # remove prefix from http header
proxy_pass http://localhost:5555;
proxy_set_header Host $host;
(continues on next page)
To ensure that Airflow generates URLs with the correct scheme when running behind a TLS-terminating proxy, you
should configure the proxy to set the X-Forwarded-Proto header, and enable the ProxyFix middleware in your air-
flow.cfg:
enable_proxy_fix = True
Note: you should only enable the ProxyFix middleware when running Airflow behind a trusted proxy (AWS ELB,
nginx, etc.).
Airflow has limited support for Microsoft Azure: interfaces exist only for Azure Blob Storage and Azure Data Lake.
Hook, Sensor and Operator for Blob Storage and Azure Data Lake Hook are in contrib section.
All classes communicate via the Window Azure Storage Blob protocol. Make sure that a Airflow connection of type
wasb exists. Authorization can be done by supplying a login (=Storage account name) and password (=KEY), or login
and SAS token in the extra field (see connection wasb_default for an example).
• WasbBlobSensor: Checks if a blob is present on Azure Blob storage.
• WasbPrefixSensor: Checks if blobs matching a prefix are present on Azure Blob storage.
• FileToWasbOperator: Uploads a local file to a container as a blob.
• WasbHook: Interface with Azure Blob Storage.
WasbBlobSensor
WasbPrefixSensor
FileToWasbOperator
WasbHook
Cloud variant of a SMB file share. Make sure that a Airflow connection of type wasb exists. Authorization can be
done by supplying a login (=Storage account name) and password (=Storage account key), or login and SAS token in
the extra field (see connection wasb_default for an example).
AzureFileShareHook
3.16.2.3 Logging
Airflow can be configured to read and write task logs in Azure Blob Storage. See Writing Logs to Azure Blob Storage.
AzureCosmosDBHook communicates via the Azure Cosmos library. Make sure that a Airflow connection of type
azure_cosmos exists. Authorization can be done by supplying a login (=Endpoint uri), password (=secret key) and
extra fields database_name and collection_name to specify the default database and collection to use (see connection
azure_cosmos_default for an example).
• AzureCosmosDBHook: Interface with Azure CosmosDB.
• AzureCosmosInsertDocumentOperator: Simple operator to insert document into CosmosDB.
• AzureCosmosDocumentSensor: Simple sensor to detect document existence in CosmosDB.
AzureCosmosDBHook
AzureCosmosInsertDocumentOperator
AzureCosmosDocumentSensor
AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection
of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client
Secret) and extra fields tenant (Tenant) and account_name (Account Name)
(see connection azure_data_lake_default for an example).
• AzureDataLakeHook: Interface with Azure Data Lake.
• AzureDataLakeStorageListOperator: Lists the files located in a specified Azure Data Lake path.
• AdlsToGoogleCloudStorageOperator: Copies files from an Azure Data Lake path to a Google Cloud Storage
bucket.
AzureDataLakeHook
AzureDataLakeStorageListOperator
AdlsToGoogleCloudStorageOperator
Airflow has extensive support for Amazon Web Services. But note that the Hooks, Sensors and Operators are in the
contrib section.
EmrAddStepsOperator
class airflow.contrib.operators.emr_add_steps_operator.EmrAddStepsOperator(**kwargs)
Bases: airflow.models.BaseOperator
An operator that adds steps to an existing EMR job_flow.
Parameters
• job_flow_id (str) – id of the JobFlow to add steps to. (templated)
• aws_conn_id (str) – aws connection to uses
• steps (list) – boto3 style steps to be added to the jobflow. (templated)
EmrCreateJobFlowOperator
class airflow.contrib.operators.emr_create_job_flow_operator.EmrCreateJobFlowOperator(**kwa
Bases: airflow.models.BaseOperator
Creates an EMR JobFlow, reading the config from the EMR connection. A dictionary of JobFlow overrides can
be passed that override the config from the connection.
Parameters
• aws_conn_id (str) – aws connection to uses
• emr_conn_id (str) – emr connection to use
• job_flow_overrides (dict) – boto3 style arguments to override emr_connection ex-
tra. (templated)
EmrTerminateJobFlowOperator
class airflow.contrib.operators.emr_terminate_job_flow_operator.EmrTerminateJobFlowOperator
Bases: airflow.models.BaseOperator
Operator to terminate EMR JobFlows.
Parameters
• job_flow_id (str) – id of the JobFlow to terminate. (templated)
• aws_conn_id (str) – aws connection to uses
EmrHook
3.16.3.2 AWS S3
S3Hook
S3FileTransformOperator
class airflow.operators.s3_file_transform_operator.S3FileTransformOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from a source S3 location to a temporary location on the local filesystem. Runs a transformation on
this file as specified by the transformation script and uploads the output to a destination S3 location.
The locations of the source and the destination files in the local filesystem is provided as an first and second
arguments to the transformation script. The transformation script is expected to read the data from source,
transform it and write the output to the local destination file. The operator then takes over control and uploads
the local destination file to S3.
S3 Select is also available to filter the source contents. Users can omit the transformation script if S3 Select
expression is specified.
Parameters
• source_s3_key (str) – The key to be retrieved from S3. (templated)
• source_aws_conn_id (str) – source s3 connection
• source_verify (bool or str) – Whether or not to verify SSL certificates for S3
connetion. By default SSL certificates are verified. You can provide the following values:
– False: do not validate SSL certificates. SSL will still be used (unless use_ssl is
False), but SSL certificates will not be verified.
– path/to/cert/bundle.pem: A filename of the CA cert bundle to uses. You
can specify this argument if you want to use a different CA cert bundle than the one
used by botocore.
This is also applicable to dest_verify.
• dest_s3_key (str) – The key to be written from S3. (templated)
• dest_aws_conn_id (str) – destination s3 connection
• replace (bool) – Replace dest S3 key if it already exists
• transform_script (str) – location of the executable transformation script
• select_expression (str) – S3 Select expression
S3ListOperator
class airflow.contrib.operators.s3_list_operator.S3ListOperator(**kwargs)
Bases: airflow.models.BaseOperator
List all objects from the bucket with the given string prefix in name.
This operator returns a python list with the name of objects which can be used by xcom in the downstream task.
Parameters
• bucket (string) – The S3 bucket where to find the objects. (templated)
• prefix (string) – Prefix string to filters the objects whose name begin with such prefix.
(templated)
• delimiter (string) – the delimiter marks key hierarchy. (templated)
• aws_conn_id (string) – The connection ID to use when connecting to S3 storage.
Parame verify Whether or not to verify SSL certificates for S3 connection. By default SSL certifi-
cates are verified. You can provide the following values: - False: do not validate SSL certificates.
SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.
Example: The following operator would list all the files (excluding subfolders) from the S3 customers/
2018/04/ key in the data bucket.
s3_file = S3ListOperator(
task_id='list_3s_files',
bucket='data',
prefix='customers/2018/04/',
delimiter='/',
aws_conn_id='aws_customers_conn'
)
S3ToGoogleCloudStorageOperator
class airflow.contrib.operators.s3_to_gcs_operator.S3ToGoogleCloudStorageOperator(**kwargs)
Bases: airflow.contrib.operators.s3_list_operator.S3ListOperator
Synchronizes an S3 key, possibly a prefix, with a Google Cloud Storage destination path.
Parameters
• bucket (string) – The S3 bucket where to find the objects. (templated)
• prefix (string) – Prefix string which filters objects whose name begin with such prefix.
(templated)
• delimiter (string) – the delimiter marks key hierarchy. (templated)
• aws_conn_id (string) – The source S3 connection
• dest_gcs_conn_id (string) – The destination connection ID to use when connecting
to Google Cloud Storage.
• dest_gcs (string) – The destination Google Cloud Storage bucket and prefix where
you want to store the files. (templated)
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• replace (bool) – Whether you want to replace existing destination files or not.
Parame verify Whether or not to verify SSL certificates for S3 connection. By default SSL certifi-
cates are verified. You can provide the following values: - False: do not validate SSL certificates.
SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.
Example:
s3_to_gcs_op = S3ToGoogleCloudStorageOperator(
task_id='s3_to_gcs_example',
bucket='my-s3-bucket',
prefix='data/customers-201804',
dest_gcs_conn_id='google_cloud_default',
dest_gcs='gs://my.gcs.bucket/some/customers/',
replace=False,
dag=my-dag)
Note that bucket, prefix, delimiter and dest_gcs are templated, so you can use variables in them if
you wish.
S3ToGoogleCloudStorageTransferOperator
S3ToHiveTransfer
class airflow.operators.s3_to_hive_operator.S3ToHiveTransfer(**kwargs)
Bases: airflow.models.BaseOperator
Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally before loading it into
a Hive table. If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE
statements are generated. Hive data types are inferred from the cursor’s metadata from.
Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization
format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use
this operator only to stage the data into a temporary table before loading it into its final destination using a
HiveOperator.
Parameters
• s3_key (str) – The key to be retrieved from S3. (templated)
• field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive
types as values
• hive_table (str) – target Hive table, use dot notation to target a specific database.
(templated)
• create (bool) – whether to create the table if it doesn’t exist
• recreate (bool) – whether to drop and recreate the table at every execution
• partition (dict) – target partition as a dict of partition columns and values. (templated)
• headers (bool) – whether the file contains column names on the first line
• check_headers (bool) – whether the column names on the first line should be checked
against the keys of field_dict
• wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard
pattern
• delimiter (str) – field delimiter in the file
• aws_conn_id (str) – source s3 connection
• hive_cli_conn_id (str) – destination hive connection
• input_compressed (bool) – Boolean to determine if file decompression is required to
process headers
• tblproperties (dict) – TBLPROPERTIES of the hive table being created
• select_expression (str) – S3 Select expression
Parame verify Whether or not to verify SSL certificates for S3 connection. By default SSL certifi-
cates are verified. You can provide the following values: - False: do not validate SSL certificates.
SSL will still be used
(unless use_ssl is False), but SSL certificates will not be verified.
ECSOperator
class airflow.contrib.operators.ecs_operator.ECSOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a task on AWS EC2 Container Service
Parameters
• task_definition (str) – the task definition name on EC2 Container Service
• cluster (str) – the cluster name on EC2 Container Service
• overrides (dict) – the same parameter that boto3 will receive (templated): http://boto3.
readthedocs.org/en/latest/reference/services/ecs.html#ECS.Client.run_task
• aws_conn_id (str) – connection id of AWS credentials / region name. If None, creden-
tial boto3 strategy will be used (http://boto3.readthedocs.io/en/latest/guide/configuration.
html).
• region_name (str) – region name to use in AWS Hook. Override the region_name in
connection (if provided)
• launch_type (str) – the launch type on which to run your task (‘EC2’ or ‘FARGATE’)
AWSBatchOperator
class airflow.contrib.operators.awsbatch_operator.AWSBatchOperator(**kwargs)
Bases: airflow.models.BaseOperator
Execute a job on AWS Batch Service
Parameters
• job_name (str) – the name for the job that will run on AWS Batch
• job_definition (str) – the job definition name on AWS Batch
• job_queue (str) – the queue name on AWS Batch
• overrides (dict) – the same parameter that boto3 will receive on con-
tainerOverrides (templated): http://boto3.readthedocs.io/en/latest/reference/services/batch.
html#submit_job
• max_retries (int) – exponential backoff retries while waiter is not merged, 4200 = 48
hours
• aws_conn_id (str) – connection id of AWS credentials / region name. If None, creden-
tial boto3 strategy will be used (http://boto3.readthedocs.io/en/latest/guide/configuration.
html).
• region_name (str) – region name to use in AWS Hook. Override the region_name in
connection (if provided)
AwsRedshiftClusterSensor
class airflow.contrib.sensors.aws_redshift_cluster_sensor.AwsRedshiftClusterSensor(**kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a Redshift cluster to reach a specific status.
Parameters
• cluster_identifier (str) – The identifier for the cluster being pinged.
• target_status (str) – The cluster status desired.
poke(context)
Function that the sensors defined while deriving this class should override.
RedshiftHook
class airflow.contrib.hooks.redshift_hook.RedshiftHook(aws_conn_id=’aws_default’,
verify=None)
Bases: airflow.contrib.hooks.aws_hook.AwsHook
RedshiftToS3Transfer
S3ToRedshiftTransfer
For more instructions on using Amazon SageMaker in Airflow, please see the SageMaker Python SDK README.
• SageMakerHook : Interact with Amazon SageMaker.
• SageMakerTrainingOperator : Create a SageMaker training job.
• SageMakerTuningOperator : Create a SageMaker tuning job.
• SageMakerModelOperator : Create a SageMaker model.
• SageMakerTransformOperator : Create a SageMaker transform job.
• SageMakerEndpointConfigOperator : Create a SageMaker endpoint config.
• SageMakerEndpointOperator : Create a SageMaker endpoint.
SageMakerHook
describe_training_job(name)
Return the training job info associated with the name
Parameters name (str) – the name of the training job
Returns A dict contains all the training job info
describe_training_job_with_log(job_name, positions, stream_names, instance_count, state,
last_description, last_describe_job_call)
Return the training job info associated with job_name and print CloudWatch logs
describe_transform_job(name)
Return the transform job info associated with the name
Parameters name (string) – the name of the transform job
Returns A dict contains all the transform job info
describe_tuning_job(name)
Return the tuning job info associated with the name
Parameters name (string) – the name of the tuning job
Returns A dict contains all the tuning job info
get_conn()
Establish an AWS connection for SageMaker
Return type SageMaker.Client
get_log_conn()
Establish an AWS connection for retrieving logs during training
Return type CloudWatchLog.Client
log_stream(log_group, stream_name, start_time=0, skip=0)
A generator for log items in a single stream. This will yield all the items that are available at the current
moment.
Parameters
• log_group (str) – The name of the log group.
• stream_name (str) – The name of the specific stream.
• start_time (int) – The time stamp value to start reading the logs from (default: 0).
• skip (int) – The number of log entries to skip at the start (default: 0). This is for when
there are multiple entries at the same timestamp.
Return type dict
Returns
Parameters
• log_group (str) – The name of the log group.
• streams (list) – A list of the log stream names. The position of the stream in this list
is the stream number.
• positions (list) – A list of pairs of (timestamp, skip) which represents the last record
read from each stream.
Returns A tuple of (stream number, cloudwatch log event).
tar_and_s3_upload(path, key, bucket)
Tar the local file or directory and upload to s3
Parameters
• path (str) – local file or directory
• key (str) – s3 key
• bucket (str) – s3 bucket
Returns None
update_endpoint(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Update an endpoint
Parameters
• config (dict) – the config for endpoint
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to endpoint update
SageMakerTrainingOperator
class airflow.contrib.operators.sagemaker_training_operator.SageMakerTrainingOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker training job.
This operator returns The ARN of the training job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a training job (templated).
For details of the configuration parameter see SageMaker.Client.
create_training_job()
• aws_conn_id (str) – The AWS connection ID to use.
SageMakerTuningOperator
class airflow.contrib.operators.sagemaker_tuning_operator.SageMakerTuningOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker hyperparameter tuning job.
This operator returns The ARN of the tuning job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a tuning job (templated).
For details of the configuration parameter see SageMaker.Client.
create_hyper_parameter_tuning_job()
• aws_conn_id (str) – The AWS connection ID to use.
• wait_for_completion (bool) – Set to True to wait until the tuning job finishes.
• check_interval (int) – If wait is set to True, the time interval, in seconds, that this
operation waits to check the status of the tuning job.
• max_ingestion_time (int) – If wait is set to True, the operation fails if the tuning job
doesn’t finish within max_ingestion_time seconds. If you set this parameter to None, the
operation does not timeout.
SageMakerModelOperator
class airflow.contrib.operators.sagemaker_model_operator.SageMakerModelOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker model.
This operator returns The ARN of the model created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create a model.
For details of the configuration parameter see SageMaker.Client.
create_model()
• aws_conn_id (str) – The AWS connection ID to use.
SageMakerTransformOperator
class airflow.contrib.operators.sagemaker_transform_operator.SageMakerTransformOperator(**k
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker transform job.
This operator returns The ARN of the model created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a transform job (templated).
If you need to create a SageMaker transform job based on an existed SageMaker model:
config = transform_config
If you need to create both SageMaker model and SageMaker Transform job:
config = {
'Model': model_config,
'Transform': transform_config
}
SageMakerEndpointConfigOperator
class airflow.contrib.operators.sagemaker_endpoint_config_operator.SageMakerEndpointConfigO
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint config.
This operator returns The ARN of the endpoint config created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint config.
For details of the configuration parameter see SageMaker.Client.
create_endpoint_config()
• aws_conn_id (str) – The AWS connection ID to use.
SageMakerEndpointOperator
class airflow.contrib.operators.sagemaker_endpoint_operator.SageMakerEndpointOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint.
This operator returns The ARN of the endpoint created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint.
If you need to create a SageMaker endpoint based on an existed SageMaker model and an
existed SageMaker endpoint config:
config = endpoint_configuration;
If you need to create all of SageMaker model, SageMaker endpoint-config and SageMaker
endpoint:
config = {
'Model': model_configuration,
'EndpointConfig': endpoint_config_configuration,
'Endpoint': endpoint_configuration
}
For more instructions on using Amazon SageMaker in Airflow, please see the SageMaker Python SDK README.
• SageMakerHook : Interact with Amazon SageMaker.
• SageMakerTrainingOperator : Create a SageMaker training job.
• SageMakerTuningOperator : Create a SageMaker tuning job.
SageMakerHook
• wait_for_completion (bool) – Whether to keep looking for new log entries until
the job completes
• check_interval (int) – The interval in seconds between polling for new log entries
and job completion
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns None
check_tuning_config(tuning_config)
Check if a tuning configuration is valid
Parameters tuning_config (dict) – tuning_config
Returns None
configure_s3_resources(config)
Extract the S3 operations from the configuration and execute them.
Parameters config (dict) – config of SageMaker operation
Return type dict
create_endpoint(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create an endpoint
Parameters
• config (dict) – the config for endpoint
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to endpoint creation
create_endpoint_config(config)
Create an endpoint config
Parameters config (dict) – the config for endpoint-config
Returns A response to endpoint config creation
create_model(config)
Create a model job
Parameters config (dict) – the config for model
Returns A response to model creation
create_training_job(config, wait_for_completion=True, print_log=True, check_interval=30,
max_ingestion_time=None)
Create a training job
Parameters
• config (dict) – the config for training
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to training job creation
create_transform_job(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create a transform job
Parameters
• config (dict) – the config for transform job
• wait_for_completion (bool) – if the program should keep running until job fin-
ishes
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to transform job creation
create_tuning_job(config, wait_for_completion=True, check_interval=30,
max_ingestion_time=None)
Create a tuning job
Parameters
• config (dict) – the config for tuning
• wait_for_completion – if the program should keep running until job finishes
• wait_for_completion – bool
• check_interval (int) – the time interval in seconds which the operator will check
the status of any SageMaker job
• max_ingestion_time (int) – the maximum ingestion time in seconds. Any Sage-
Maker jobs that run longer than this will fail. Setting this to None implies no timeout for
any SageMaker job.
Returns A response to tuning job creation
describe_endpoint(name)
Parameters name (string) – the name of the endpoint
Returns A dict contains all the endpoint info
describe_endpoint_config(name)
Return the endpoint config info associated with the name
Parameters name (string) – the name of the endpoint config
Returns A dict contains all the endpoint config info
describe_model(name)
Return the SageMaker model info associated with the name
Parameters name (string) – the name of the SageMaker model
Returns A dict contains all the model info
describe_training_job(name)
Return the training job info associated with the name
Parameters name (str) – the name of the training job
Returns A dict contains all the training job info
describe_training_job_with_log(job_name, positions, stream_names, instance_count, state,
last_description, last_describe_job_call)
Return the training job info associated with job_name and print CloudWatch logs
describe_transform_job(name)
Return the transform job info associated with the name
Parameters name (string) – the name of the transform job
Returns A dict contains all the transform job info
describe_tuning_job(name)
Return the tuning job info associated with the name
Parameters name (string) – the name of the tuning job
Returns A dict contains all the tuning job info
get_conn()
Establish an AWS connection for SageMaker
Return type SageMaker.Client
get_log_conn()
Establish an AWS connection for retrieving logs during training
Return type CloudWatchLog.Client
log_stream(log_group, stream_name, start_time=0, skip=0)
A generator for log items in a single stream. This will yield all the items that are available at the current
moment.
Parameters
• log_group (str) – The name of the log group.
• stream_name (str) – The name of the specific stream.
• start_time (int) – The time stamp value to start reading the logs from (default: 0).
• skip (int) – The number of log entries to skip at the start (default: 0). This is for when
there are multiple entries at the same timestamp.
Return type dict
Returns
SageMakerTrainingOperator
class airflow.contrib.operators.sagemaker_training_operator.SageMakerTrainingOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker training job.
This operator returns The ARN of the training job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a training job (templated).
For details of the configuration parameter see SageMaker.Client.
create_training_job()
SageMakerTuningOperator
class airflow.contrib.operators.sagemaker_tuning_operator.SageMakerTuningOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker hyperparameter tuning job.
This operator returns The ARN of the tuning job created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a tuning job (templated).
For details of the configuration parameter see SageMaker.Client.
create_hyper_parameter_tuning_job()
• aws_conn_id (str) – The AWS connection ID to use.
• wait_for_completion (bool) – Set to True to wait until the tuning job finishes.
• check_interval (int) – If wait is set to True, the time interval, in seconds, that this
operation waits to check the status of the tuning job.
• max_ingestion_time (int) – If wait is set to True, the operation fails if the tuning job
doesn’t finish within max_ingestion_time seconds. If you set this parameter to None, the
operation does not timeout.
SageMakerModelOperator
class airflow.contrib.operators.sagemaker_model_operator.SageMakerModelOperator(**kwargs)
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker model.
This operator returns The ARN of the model created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create a model.
For details of the configuration parameter see SageMaker.Client.
create_model()
• aws_conn_id (str) – The AWS connection ID to use.
SageMakerTransformOperator
class airflow.contrib.operators.sagemaker_transform_operator.SageMakerTransformOperator(**k
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Initiate a SageMaker transform job.
This operator returns The ARN of the model created in Amazon SageMaker.
Parameters
• config (dict) – The configuration necessary to start a transform job (templated).
If you need to create a SageMaker transform job based on an existed SageMaker model:
config = transform_config
If you need to create both SageMaker model and SageMaker Transform job:
config = {
'Model': model_config,
'Transform': transform_config
}
SageMakerEndpointConfigOperator
class airflow.contrib.operators.sagemaker_endpoint_config_operator.SageMakerEndpointConfigO
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint config.
This operator returns The ARN of the endpoint config created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint config.
For details of the configuration parameter see SageMaker.Client.
create_endpoint_config()
• aws_conn_id (str) – The AWS connection ID to use.
SageMakerEndpointOperator
class airflow.contrib.operators.sagemaker_endpoint_operator.SageMakerEndpointOperator(**kwa
Bases: airflow.contrib.operators.sagemaker_base_operator.
SageMakerBaseOperator
Create a SageMaker endpoint.
This operator returns The ARN of the endpoint created in Amazon SageMaker
Parameters
• config (dict) – The configuration necessary to create an endpoint.
If you need to create a SageMaker endpoint based on an existed SageMaker model and an
existed SageMaker endpoint config:
config = endpoint_configuration;
If you need to create all of SageMaker model, SageMaker endpoint-config and SageMaker
endpoint:
config = {
'Model': model_configuration,
'EndpointConfig': endpoint_config_configuration,
'Endpoint': endpoint_configuration
}
3.16.4 Databricks
Databricks has contributed an Airflow operator which enables submitting runs to the Databricks platform. Internally
the operator talks to the api/2.0/jobs/runs/submit endpoint.
3.16.4.1 DatabricksSubmitRunOperator
class airflow.contrib.operators.databricks_operator.DatabricksSubmitRunOperator(**kwargs)
Bases: airflow.models.BaseOperator
Submits a Spark job run to Databricks using the api/2.0/jobs/runs/submit API endpoint.
There are two ways to instantiate this operator.
In the first way, you can take the JSON payload that you typically use to call the api/2.0/jobs/runs/
submit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json param-
eter. For example
json = {
'new_cluster': {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
},
'notebook_task': {
'notebook_path': '/Users/[email protected]/PrepareData',
},
}
notebook_run = DatabricksSubmitRunOperator(task_id='notebook_run', json=json)
Another way to accomplish the same thing is to use the named parameters of the
DatabricksSubmitRunOperator directly. Note that there is exactly one named parameter for
each top level parameter in the runs/submit endpoint. In this method, your code would look like this:
new_cluster = {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
}
notebook_task = {
'notebook_path': '/Users/[email protected]/PrepareData',
}
notebook_run = DatabricksSubmitRunOperator(
task_id='notebook_run',
new_cluster=new_cluster,
notebook_task=notebook_task)
In the case where both the json parameter AND the named parameters are provided, they will be merged together.
If there are conflicts during the merge, the named parameters will take precedence and override the top level
json keys.
Currently the named parameters that DatabricksSubmitRunOperator supports are
• spark_jar_task
• notebook_task
• new_cluster
• existing_cluster_id
• libraries
• run_name
• timeout_seconds
Parameters
• json (dict) – A JSON object containing API parameters which will be passed directly
to the api/2.0/jobs/runs/submit endpoint. The other named parameters (i.e.
spark_jar_task, notebook_task..) to this operator will be merged with this json
dictionary if they are provided. If there are conflicts during the merge, the named parameters
will take precedence and override the top level json keys. (templated)
See also:
For more information about templating see Jinja Templating. https://docs.databricks.com/
api/latest/jobs.html#runs-submit
• spark_jar_task (dict) – The main class and parameters for the JAR task. Note
that the actual JAR is specified in the libraries. EITHER spark_jar_task OR
notebook_task should be specified. This field will be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobssparkjartask
• notebook_task (dict) – The notebook path and parameters for the notebook task.
EITHER spark_jar_task OR notebook_task should be specified. This field will
be templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsnotebooktask
• new_cluster (dict) – Specs for a new cluster on which this task will be run. EITHER
new_cluster OR existing_cluster_id should be specified. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/jobs.html#jobsclusterspecnewcluster
• existing_cluster_id (string) – ID for existing cluster on which to run this task.
EITHER new_cluster OR existing_cluster_id should be specified. This field
will be templated.
• libraries (list of dicts) – Libraries which this run will use. This field will be
templated.
See also:
https://docs.databricks.com/api/latest/libraries.html#managedlibrarieslibrary
• run_name (string) – The run name used for this task. By default this will be set
to the Airflow task_id. This task_id is a required parameter of the superclass
BaseOperator. This field will be templated.
• timeout_seconds (int32) – The timeout for this run. By default a value of 0 is used
which means to have no timeout. This field will be templated.
• databricks_conn_id (string) – The name of the Airflow connection to use. By
default and in the common case this will be databricks_default. To use token based
authentication, provide the key token in the extra field for the connection.
• polling_period_seconds (int) – Controls the rate which we poll for the result of
this run. By default the operator will poll every 30 seconds.
• databricks_retry_limit (int) – Amount of times retry if the Databricks backend
is unreachable. Its value must be greater than or equal to 1.
Airflow has extensive support for the Google Cloud Platform. But note that most Hooks and Operators are in the
contrib section. Meaning that they have a beta status, meaning that they can have breaking changes between minor
releases.
See the GCP connection type documentation to configure connections to GCP.
3.16.5.1 Logging
Airflow can be configured to read and write task logs in Google Cloud Storage. See Writing Logs to Google Cloud
Storage.
3.16.5.2 GoogleCloudBaseHook
class airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook(gcp_conn_id=’google_cloud_defaul
dele-
gate_to=None)
Bases: airflow.hooks.base_hook.BaseHook, airflow.utils.log.logging_mixin.
LoggingMixin
A base hook for Google cloud-related hooks. Google cloud has a shared REST API client that is built in the
same way no matter which service you use. This class helps construct and authorize the credentials needed to
then call googleapiclient.discovery.build() to actually discover and build a client for a Google cloud service.
The class also contains some miscellaneous helper functions.
All hook derived from this base hook use the ‘Google Cloud Platform’ connection type. Three ways of authen-
tication are supported:
Default credentials: Only the ‘Project Id’ is required. You’ll need to have set up default credentials, such
as by the GOOGLE_APPLICATION_DEFAULT environment variable or from the metadata server on Google
Compute Engine.
JSON key file: Specify ‘Project Id’, ‘Keyfile Path’ and ‘Scope’.
Legacy P12 key files are not supported.
JSON data provided in the UI: Specify ‘Keyfile JSON’.
static fallback_to_default_project_id(func)
Decorator that provides fallback for Google Cloud Platform project id. If the project is None it will be
replaced with the project_id from the service account the Hook is authenticated with. Project id can be
specified either via project_id kwarg or via first parameter in positional args.
Parameters func – function to wrap
Returns result of the function call
3.16.5.3 BigQuery
BigQuery Operators
• BigQueryCheckOperator : Performs checks against a SQL query that will return a single row with different
values.
• BigQueryValueCheckOperator : Performs a simple value check using SQL code.
• BigQueryIntervalCheckOperator : Checks that the values of metrics given as SQL expressions are within a
certain tolerance of the ones from days_back before.
• BigQueryGetDataOperator : Fetches the data from a BigQuery table and returns data in a python list
• BigQueryCreateEmptyDatasetOperator : Creates an empty BigQuery dataset.
• BigQueryCreateEmptyTableOperator : Creates a new, empty table in the specified BigQuery dataset optionally
with schema.
• BigQueryCreateExternalTableOperator : Creates a new, external table in the dataset with the data in Google
Cloud Storage.
• BigQueryDeleteDatasetOperator : Deletes an existing BigQuery dataset.
• BigQueryTableDeleteOperator : Deletes an existing BigQuery table.
• BigQueryOperator : Executes BigQuery SQL queries in a specific BigQuery database.
• BigQueryToBigQueryOperator : Copy a BigQuery table to another BigQuery table.
• BigQueryToCloudStorageOperator : Transfers a BigQuery table to a Google Cloud Storage bucket
BigQueryCheckOperator
class airflow.contrib.operators.bigquery_check_operator.BigQueryCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.CheckOperator
Performs checks against BigQuery. The BigQueryCheckOperator expects a sql query that will return a
single row. Each value on that first row is evaluated using python bool casting. If any of the values return
False the check is failed and errors out.
Note that Python bool casting evals the following as False:
• False
• 0
• Empty string ("")
• Empty list ([])
• Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo, it will fail only if the count == 0. You can craft much
more complex query that could, for instance, check that the table has the same number of rows as the source
table upstream, or that the count of today’s partition is greater than yesterday’s partition, or that a set of metrics
are less than 3 standard deviation for the 7 day average.
This operator can be used as a data quality check in your pipeline, and depending on where you put it in your
DAG, you have the choice to stop the critical path, preventing from publishing dubious data, or on the side and
receive email alterts without stopping the progress of the DAG.
Parameters
BigQueryValueCheckOperator
class airflow.contrib.operators.bigquery_check_operator.BigQueryValueCheckOperator(**kwargs)
Bases: airflow.operators.check_operator.ValueCheckOperator
Performs a simple value check using sql code.
Parameters
• sql (string) – the sql to be executed
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
BigQueryIntervalCheckOperator
class airflow.contrib.operators.bigquery_check_operator.BigQueryIntervalCheckOperator(**kwa
Bases: airflow.operators.check_operator.IntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from
days_back before.
This method constructs a query like so
Parameters
• table (str) – the table name
• days_back (int) – number of days between ds and the ds we want to check against.
Defaults to 7 days
• metrics_threshold (dict) – a dictionary of ratios indexed by metrics, for example
‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and
the prior days_back.
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
BigQueryGetDataOperator
class airflow.contrib.operators.bigquery_get_data.BigQueryGetDataOperator(**kwargs)
Bases: airflow.models.BaseOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a
python list. The number of elements in the returned list will be equal to the number of rows fetched. Each
element in the list will again be a list where element would represent the columns values for that row.
Example Result: [['Tony', '10'], ['Mike', '20'], ['Steve', '15']]
Note: If you pass fields to selected_fields which are in different order than the order of columns already
in BQ table, the data will still be in the order of BQ table. For example if the BQ table has 3 columns as
[A,B,C] and you pass ‘B,A’ in the selected_fields the data would still be of the form 'A,B'.
Example:
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
Parameters
• dataset_id (string) – The dataset ID of the requested table. (templated)
• table_id (string) – The table ID of the requested table. (templated)
• max_results (string) – The maximum number of records (rows) to be fetched from
the table. (templated)
• selected_fields (string) – List of fields to return (comma-separated). If unspeci-
fied, all fields are returned.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
BigQueryCreateEmptyTableOperator
class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyTableOperator(**kwargs)
Bases: airflow.models.BaseOperator
Creates a new, empty table in the specified BigQuery dataset, optionally with schema.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass
the schema fields in, or you may point the operator to a Google cloud storage object name. The object in Google
cloud storage must be a JSON file with the schema fields in it. You can also create a table without schema.
Parameters
• project_id (string) – The project to create the table into. (templated)
• dataset_id (string) – The dataset to create the table into. (templated)
• table_id (string) – The Name of the table to be created. (templated)
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)
BigQueryCreateExternalTableOperator
class airflow.contrib.operators.bigquery_operator.BigQueryCreateExternalTableOperator(**kwa
Bases: airflow.models.BaseOperator
Creates a new external table in the dataset with the data in Google Cloud Storage.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly
pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in
Google cloud storage must be a JSON file with the schema fields in it.
Parameters
• bucket (string) – The bucket to point the external table to. (templated)
• source_objects (list) – List of Google cloud storage URIs to point table to. (tem-
plated) If source_format is ‘DATASTORE_BACKUP’, the list must only contain a single
URI.
• destination_project_dataset_table (string) – The dotted
(<project>.)<dataset>.<table> BigQuery table to load data into (templated). If <project> is
not included, project will be the project defined in the connection json.
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema
Example:
• quote_character (string) – The value that is used to quote data sections in a CSV
file.
• allow_quoted_newlines (boolean) – Whether to allow quoted newlines (true) or
not (false).
• allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns.
The missing values are treated as nulls. If false, records with missing trailing columns are
treated as bad records, and if there are too many bad records, an invalid error is returned in
the job result. Only applicable to CSV, ignored for other formats.
• bigquery_conn_id (string) – Reference to a specific BigQuery hook.
• google_cloud_storage_conn_id (string) – Reference to a specific Google
cloud storage hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• src_fmt_configs (dict) – configure optional fields specific to the source format
:param labels a dictionary containing labels for the table, passed to BigQuery :type labels: dict
BigQueryCreateEmptyDatasetOperator
class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyDatasetOperator(**kwarg
Bases: airflow.models.BaseOperator
” This operator is used to create new dataset for your Project in Big query. https://cloud.google.com/bigquery/
docs/reference/rest/v2/datasets#resource
Parameters
• project_id (str) – The name of the project where we want to create the dataset. Don’t
need to provide, if projectId in dataset_reference.
• dataset_id (str) – The id of dataset. Don’t need to provide, if datasetId in
dataset_reference.
• dataset_reference – Dataset reference that could be provided with request body.
More info: https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource
BigQueryDeleteDatasetOperator
class airflow.contrib.operators.bigquery_operator.BigQueryDeleteDatasetOperator(**kwargs)
Bases: airflow.models.BaseOperator
” This operator deletes an existing dataset from your Project in Big query. https://cloud.google.com/bigquery/
docs/reference/rest/v2/datasets/delete :param project_id: The project id of the dataset. :type project_id: string
:param dataset_id: The dataset to be deleted. :type dataset_id: string
Example:
BigQueryTableDeleteOperator
class airflow.contrib.operators.bigquery_table_delete_operator.BigQueryTableDeleteOperator(
Bases: airflow.models.BaseOperator
Deletes BigQuery tables
Parameters
• deletion_dataset_table (string) – A dotted
(<project>.|<project>:)<dataset>.<table> that indicates which table will be deleted.
(templated)
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• ignore_if_missing (boolean) – if True, then return success even if the requested
table does not exist.
BigQueryOperator
class airflow.contrib.operators.bigquery_operator.BigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Executes BigQuery SQL queries in a specific BigQuery database
Parameters
• bql (Can receive a str representing a sql statement, a list
of str (sql statements), or reference to a template file.
Template reference are recognized by str ending in '.sql'.) –
(Deprecated. Use sql parameter instead) the sql code to be executed (templated)
• sql (Can receive a str representing a sql statement, a list
of str (sql statements), or reference to a template file.
Template reference are recognized by str ending in '.sql'.) –
the sql code to be executed (templated)
• destination_dataset_table (string) – A dotted
(<project>.|<project>:)<dataset>.<table> that, if set, will store the results of the query.
(templated)
• write_disposition (string) – Specifies the action that occurs if the destination
table already exists. (default: ‘WRITE_EMPTY’)
• create_disposition (string) – Specifies whether the job is allowed to create new
tables. (default: ‘CREATE_IF_NEEDED’)
• allow_large_results (boolean) – Whether to allow large results.
• flatten_results (boolean) – If true and query uses legacy SQL dialect, flattens all
nested and repeated fields in the query results. allow_large_results must be true
if this is set to false. For standard SQL queries, this flag is ignored and results are never
flattened.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• udf_config (list) – The User Defined Function configuration for the query. See https:
//cloud.google.com/bigquery/user-defined-functions for details.
• use_legacy_sql (boolean) – Whether to use legacy SQL (true) or standard SQL
(false).
• maximum_billing_tier (integer) – Positive integer that serves as a multiplier of
the basic price. Defaults to None, in which case it uses the value set in the project.
• maximum_bytes_billed (float) – Limits the bytes billed for this job. Queries that
will have bytes billed beyond this limit will fail (without incurring a charge). If unspecified,
this will be set to your project default.
• api_resource_configs (dict) – a dictionary that contain params ‘configuration’
applied for Google BigQuery Jobs API: https://cloud.google.com/bigquery/docs/reference/
rest/v2/jobs for example, {‘query’: {‘useQueryCache’: False}}. You could use it if you
need to provide some params that are not supported by BigQueryOperator like args.
• schema_update_options (tuple) – Allows the schema of the destination table to be
updated as a side effect of the load job.
• query_params (dict) – a dictionary containing query parameter types and values,
passed to BigQuery.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery
• priority (string) – Specifies a priority for the query. Possible values include INTER-
ACTIVE and BATCH. The default value is INTERACTIVE.
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications.
• cluster_fields (list of str) – Request that the result of this query be stored
sorted by one or more columns. This is only available in conjunction with time_partitioning.
The order of columns given determines the sort order.
• location (str) – The geographic location of the job. Required except for US and EU.
See details at https://cloud.google.com/bigquery/docs/locations#specifying_your_location
BigQueryToBigQueryOperator
class airflow.contrib.operators.bigquery_to_bigquery.BigQueryToBigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Copies data from one BigQuery table to another.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs#
configuration.copy
Parameters
• source_project_dataset_tables (list|string) – One or more dotted
(project:|project.)<dataset>.<table> BigQuery tables to use as the source data. If <project>
is not included, project will be the project defined in the connection json. Use a list if there
are multiple source tables. (templated)
• destination_project_dataset_table (string) – The destination BigQuery
table. Format is: (project:|project.)<dataset>.<table> (templated)
BigQueryToCloudStorageOperator
class airflow.contrib.operators.bigquery_to_gcs.BigQueryToCloudStorageOperator(**kwargs)
Bases: airflow.models.BaseOperator
Transfers a BigQuery table to a Google Cloud Storage bucket.
See also:
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs
Parameters
• source_project_dataset_table (string) – The dotted (<project>.
|<project>:)<dataset>.<table> BigQuery table to use as the source data. If
<project> is not included, project will be the project defined in the connection json. (tem-
plated)
• destination_cloud_storage_uris (list) – The destination Google Cloud Stor-
age URI (e.g. gs://some-bucket/some-file.txt). (templated) Follows convention defined here:
https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple
• compression (string) – Type of compression to use.
• export_format (string) – File format to export.
• field_delimiter (string) – The delimiter to use when extracting to a CSV.
• print_header (boolean) – Whether to print a header for a CSV file extract.
• bigquery_conn_id (string) – reference to a specific BigQuery hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• labels (dict) – a dictionary containing labels for the job/query, passed to BigQuery
BigQueryHook
class airflow.contrib.hooks.bigquery_hook.BigQueryHook(bigquery_conn_id=’bigquery_default’,
delegate_to=None,
use_legacy_sql=True,
location=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook, airflow.
hooks.dbapi_hook.DbApiHook, airflow.utils.log.logging_mixin.LoggingMixin
Interact with BigQuery. This hook uses the Google Cloud Platform connection.
get_conn()
Returns a BigQuery PEP 249 connection object.
CloudSpannerInstanceDatabaseDeleteOperator
CloudSpannerInstanceDatabaseDeployOperator
CloudSpannerInstanceDatabaseUpdateOperator
CloudSpannerInstanceDatabaseQueryOperator
CloudSpannerInstanceDeployOperator
CloudSpannerInstanceDeleteOperator
CloudSpannerHook
CloudSqlInstanceDatabaseDeleteOperator
CloudSqlInstanceDatabaseCreateOperator
CloudSqlInstanceDatabasePatchOperator
CloudSqlInstanceDeleteOperator
CloudSqlInstanceExportOperator
CloudSqlInstanceImportOperator
CloudSqlInstanceCreateOperator
CloudSqlInstancePatchOperator
CloudSqlQueryOperator
BigtableInstanceCreateOperator
BigtableInstanceDeleteOperator
BigtableClusterUpdateOperator
BigtableTableCreateOperator
BigtableTableDeleteOperator
BigtableTableWaitForReplicationSensor
GceInstanceStartOperator
class airflow.contrib.operators.gcp_compute_operator.GceInstanceStartOperator(**kwargs)
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Starts an instance in Google Compute Engine.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists.
• resource_id (str) – Name of the Compute Engine instance resource.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body – Optional, If set to False, body validation is not performed. Defaults to
False.
GceInstanceStopOperator
class airflow.contrib.operators.gcp_compute_operator.GceInstanceStopOperator(**kwargs)
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Stops an instance in Google Compute Engine.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists.
• resource_id (str) – Name of the Compute Engine instance resource.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body – Optional, If set to False, body validation is not performed. Defaults to
False.
GceSetMachineTypeOperator
class airflow.contrib.operators.gcp_compute_operator.GceSetMachineTypeOperator(**kwargs)
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Changes the machine type for a stopped instance to the machine type specified in the request.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists.
• resource_id (str) – Name of the Compute Engine instance resource.
• body (dict) – Body required by the Compute Engine setMachineType API, as described
in https://cloud.google.com/compute/docs/reference/rest/v1/instances/setMachineType#
request-body
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body (bool) – Optional, If set to False, body validation is not performed.
Defaults to False.
GceInstanceTemplateCopyOperator
class airflow.contrib.operators.gcp_compute_operator.GceInstanceTemplateCopyOperator(**kwarg
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Copies the instance template, applying specified changes.
Parameters
• resource_id (str) – Name of the Instance Template
• body_patch (dict) – Patch to the body of instanceTemplates object following rfc7386
PATCH semantics. The body_patch content follows https://cloud.google.com/compute/
docs/reference/rest/v1/instanceTemplates Name field is required as we need to rename the
template, all the other fields are optional. It is important to follow PATCH semantics - ar-
rays are replaced fully, so if you need to update an array you should provide the whole target
array as patch element.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• request_id (str) – Optional, unique request_id that you might add to achieve full idem-
potence (for example when client call times out repeating the request with the same request
id will not create a new instance template again). It should be in UUID format as defined in
RFC 4122.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body (bool) – Optional, If set to False, body validation is not performed.
Defaults to False.
GceInstanceGroupManagerUpdateTemplateOperator
class airflow.contrib.operators.gcp_compute_operator.GceInstanceGroupManagerUpdateTemplateO
Bases: airflow.contrib.operators.gcp_compute_operator.GceBaseOperator
Patches the Instance Group Manager, replacing source template URL with the destination one. API V1 does not
have update/patch operations for Instance Group Manager, so you must use beta or newer API version. Beta is
the default.
Parameters
• resource_id (str) – Name of the Instance Group Manager
• zone (str) – Google Cloud Platform zone where the Instance Group Manager exists.
• source_template (str) – URL of the template to replace.
• destination_template (str) – URL of the target template.
• project_id (str) – Optional, Google Cloud Platform Project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• request_id (str) – Optional, unique request_id that you might add to achieve full idem-
potence (for example when client call times out repeating the request with the same request
id will not create a new instance template again). It should be in UUID format as defined in
RFC 4122.
• gcp_conn_id (str) – Optional, The connection ID used to connect to Google Cloud
Platform. Defaults to ‘google_cloud_default’.
• api_version (str) – Optional, API version used (for example v1 - or beta). Defaults to
v1.
• validate_body (bool) – Optional, If set to False, body validation is not performed.
Defaults to False.
class airflow.contrib.hooks.gcp_compute_hook.GceHook(api_version=’v1’,
gcp_conn_id=’google_cloud_default’,
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Hook for Google Compute Engine APIs.
All the methods in the hook where project_id is used must be called with keyword arguments rather than posi-
tional.
get_conn()
Retrieves connection to Google Compute Engine.
Returns Google Compute Engine services object
Return type dict
get_instance_group_manager(*args, **kwargs)
Retrieves Instance Group Manager by project_id, zone and resource_id. Must be called with keyword
arguments rather than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the Instance Group Manager exists
• resource_id (str) – Name of the Instance Group Manager
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns Instance group manager representation as object according to https://cloud.google.com/
compute/docs/reference/rest/beta/instanceGroupManagers
Return type dict
get_instance_template(*args, **kwargs)
Retrieves instance template by project_id and resource_id. Must be called with keyword arguments rather
than positional.
Parameters
• resource_id (str) – Name of the instance template
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns None
start_instance(*args, **kwargs)
Starts an existing instance defined by project_id, zone and resource_id. Must be called with keyword
arguments rather than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists
• resource_id (str) – Name of the Compute Engine instance resource
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns None
stop_instance(*args, **kwargs)
Stops an instance defined by project_id, zone and resource_id Must be called with keyword arguments
rather than positional.
Parameters
• zone (str) – Google Cloud Platform zone where the instance exists
• resource_id (str) – Name of the Compute Engine instance resource
• project_id (str) – Optional, Google Cloud Platform project ID where the Compute
Engine Instance exists. If set to None or missing, the default project_id from the GCP
connection is used.
Returns None
members
GcfFunctionDeployOperator
class airflow.contrib.operators.gcp_function_operator.GcfFunctionDeployOperator(**kwargs)
Bases: airflow.models.BaseOperator
Creates a function in Google Cloud Functions. If a function with this name already exists, it will be updated.
Parameters
• location (str) – Google Cloud Platform region where the function should be created.
GcfFunctionDeleteOperator
class airflow.contrib.operators.gcp_function_operator.GcfFunctionDeleteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Deletes the specified function from Google Cloud Functions.
Parameters
• name (str) – A fully-qualified function name, matching the pattern:
^projects/[^/]+/locations/[^/]+/functions/[^/]+$
• gcp_conn_id (str) – The connection ID to use to connect to Google Cloud Platform.
• api_version (str) – API version used (for example v1 or v1beta1).
class airflow.contrib.hooks.gcp_function_hook.GcfHook(api_version,
gcp_conn_id=’google_cloud_default’,
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Hook for the Google Cloud Functions APIs.
create_new_function(*args, **kwargs)
Creates a new function in Cloud Function in the location specified in the body.
Parameters
• location (str) – The location of the function.
• body (dict) – The body required by the Cloud Functions insert API.
• project_id (str) – Optional, Google Cloud Project project_id where the function
belongs. If set to None or missing, the default project_id from the GCP connection is
used.
Returns None
delete_function(name)
Deletes the specified Cloud Function.
Parameters name (str) – The name of the function.
Returns None
get_conn()
Retrieves the connection to Cloud Functions.
Returns Google Cloud Function services object.
Return type dict
get_function(name)
Returns the Cloud Function with the given name.
Parameters name (str) – Name of the function.
Returns A Cloud Functions object representing the function.
Return type dict
update_function(name, body, update_mask)
Updates Cloud Functions according to the specified update mask.
Parameters
• name (str) – The name of the function.
• body (dict) – The body required by the cloud function patch API.
• update_mask ([str]) – The update mask - array of fields that should be patched.
Returns None
upload_function_zip(*args, **kwargs)
Uploads zip file with sources.
Parameters
• location (str) – The location where the function is created.
• zip_path (str) – The path of the valid .zip file to upload.
• project_id (str) – Optional, Google Cloud Project project_id where the function
belongs. If set to None or missing, the default project_id from the GCP connection is
used.
Returns The upload URL that was returned by generateUploadUrl method.
DataFlow Operators
DataFlowJavaOperator
class airflow.contrib.operators.dataflow_operator.DataFlowJavaOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Java Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params
Parameters
• jar (string) – The reference to a self executing DataFlow jar.
• dataflow_default_options (dict) – Map of default job options.
• options (dict) – Map of job specific options.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
• job_class (string) – The name of the dataflow job class to be executued, it is often
not the main class configured in the dataflow jar file.
Both jar and options are templated so you can use variables in them.
Note that both dataflow_default_options and options will be merged to specify pipeline execution
parameter, and dataflow_default_options is expected to save high-level options, for instances, project
and zone information, which apply to all dataflow operators in the DAG.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.
default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'europe-west1-d',
'stagingLocation': 'gs://my-staging-bucket/staging/'
}
}
You need to pass the path to your dataflow as a file reference with the jar parameter, the jar needs to
be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/
#self-executing-jar). Use options to pass on options to your job.
t1 = DataFlowJavaOperator(
task_id='datapflow_example',
jar='{{var.value.gcp_dataflow_base}}pipeline/build/libs/pipeline-example-1.0.
˓→jar',
options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
(continues on next page)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date':
(2016, 8, 1),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=30),
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'us-central1-f',
'stagingLocation': 'gs://bucket/tmp/dataflow/staging/',
}
}
task = DataFlowJavaOperator(
gcp_conn_id='gcp_default',
task_id='normalize-cal',
jar='{{var.value.gcp_dataflow_base}}pipeline-ingress-cal-normalize-1.0.jar',
options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
'start': '{{ds}}',
'partitionType': 'DAY'
},
dag=dag)
DataflowTemplateOperator
class airflow.contrib.operators.dataflow_operator.DataflowTemplateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Templated Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
Parameters
• template (string) – The reference to the DataFlow template.
• dataflow_default_options (dict) – Map of default job environment options.
• parameters (dict) – Map of job specific parameters for the template.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and
staging location.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters https://cloud.google.
com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment
default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project'
'zone': 'europe-west1-d',
'tempLocation': 'gs://my-staging-bucket/staging/'
}
}
}
You need to pass the path to your dataflow template as a file reference with the template parameter. Use
parameters to pass on parameters to your job. Use environment to pass on runtime environment variables
to your job.
t1 = DataflowTemplateOperator(
task_id='datapflow_example',
template='{{var.value.gcp_dataflow_base}}',
parameters={
'inputFile': "gs://bucket/input/my_input.txt",
'outputFile': "gs://bucket/output/my_output.txt"
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)
template, dataflow_default_options and parameters are templated so you can use variables in
them.
Note that dataflow_default_options is expected to save high-level options for project information,
which apply to all dataflow operators in the DAG.
See also:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3 /LaunchTemplateParameters
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment For more de-
tail on job template execution have a look at the reference: https://cloud.google.com/dataflow/docs/
templates/executing-templates
DataFlowPythonOperator
class airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator(**kwargs)
Bases: airflow.models.BaseOperator
Launching Cloud Dataflow jobs written in python. Note that both dataflow_default_options and options will
be merged to specify pipeline execution parameter, and dataflow_default_options is expected to save high-level
options, for instances, project and zone information, which apply to all dataflow operators in the DAG.
See also:
For more detail on job submission have a look at the reference: https://cloud.google.com/dataflow/pipelines/
specifying-exec-params
Parameters
• py_file (string) – Reference to the python dataflow pipleline file.py, e.g.,
/some/local/file/path/to/your/python/pipeline/file.
• py_options – Additional python options.
• dataflow_default_options (dict) – Map of default job options.
• options (dict) – Map of job specific options.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Plat-
form for the dataflow job status while the job is in the JOB_STATE_RUNNING state.
execute(context)
Execute the python dataflow job.
DataFlowHook
class airflow.contrib.hooks.gcp_dataflow_hook.DataFlowHook(gcp_conn_id=’google_cloud_default’,
delegate_to=None,
poll_sleep=10)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
get_conn()
Returns a Google Cloud Dataflow service object.
DataProc Operators
DataprocClusterCreateOperator
class airflow.contrib.operators.dataproc_operator.DataprocClusterCreateOperator(**kwargs)
Bases: airflow.models.BaseOperator
Create a new cluster on Google Cloud Dataproc. The operator will wait until the creation is successful or an
error occurs in the creation process.
The parameters allow to configure the cluster. Please refer to
https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters
for a detailed explanation on the different parameters. Most of the configuration parameters detailed in the link
are available as a parameter to this operator.
Parameters
• cluster_name (string) – The name of the DataProc cluster to create. (templated)
• project_id (str) – The ID of the google cloud project in which to create the cluster.
(templated)
• num_workers (int) – The # of workers to spin up. If set to zero will spin up cluster in a
single node mode
• storage_bucket (string) – The storage bucket to use, setting to None lets dataproc
generate a custom one for you
• init_actions_uris (list[string]) – List of GCS uri’s containing dataproc ini-
tialization scripts
• init_action_timeout (string) – Amount of time executable scripts in
init_actions_uris has to complete
• metadata (dict) – dict of key-value google compute engine metadata entries to add to
all instances
• image_version (string) – the version of software inside the Dataproc cluster
• custom_image – custom Dataproc image for more info see https://cloud.google.com/
dataproc/docs/guides/dataproc-images
• properties (dict) – dict of properties to set on config files (e.g. spark-defaults.conf),
see https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#
SoftwareConfig
• master_machine_type (string) – Compute engine machine type to use for the mas-
ter node
• master_disk_type (string) – Type of the boot disk for the master node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• master_disk_size (int) – Disk size for the master node
• worker_machine_type (string) – Compute engine machine type to use for the
worker nodes
• worker_disk_type (string) – Type of the boot disk for the worker node (de-
fault is pd-standard). Valid values: pd-ssd (Persistent Disk Solid State Drive) or
pd-standard (Persistent Disk Hard Disk Drive).
• worker_disk_size (int) – Disk size for the worker nodes
• num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
• labels (dict) – dict of labels to add to the cluster
• zone (string) – The zone where the cluster will be located. (templated)
• network_uri (string) – The network uri to be used for machine communication, can-
not be specified with subnetwork_uri
• subnetwork_uri (string) – The subnetwork uri to be used for machine communica-
tion, cannot be specified with network_uri
• internal_ip_only (bool) – If true, all instances in the cluster will only have internal
IP addresses. This can only be enabled for subnetwork enabled networks
• tags (list[string]) – The GCE tags to add to all instances
• region – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• service_account (string) – The service account of the dataproc instances.
• service_account_scopes (list[string]) – The URIs of service account scopes
to be included.
• idle_delete_ttl (int) – The longest duration that cluster would keep alive while
staying idle. Passing this threshold will cause cluster to be auto-deleted. A duration in
seconds.
• auto_delete_time (datetime.datetime) – The time when cluster will be auto-
deleted.
• auto_delete_ttl (int) – The life duration of cluster, the cluster will be auto-deleted
at the end of this duration. A duration in seconds. (If auto_delete_time is set this parameter
will be ignored)
Type custom_image: string
DataprocClusterScaleOperator
class airflow.contrib.operators.dataproc_operator.DataprocClusterScaleOperator(**kwargs)
Bases: airflow.models.BaseOperator
Scale, up or down, a cluster on Google Cloud Dataproc. The operator will wait until the cluster is re-scaled.
Example:
t1 = DataprocClusterScaleOperator(
task_id='dataproc_scale',
project_id='my-project',
cluster_name='cluster-1',
(continues on next page)
See also:
For more detail on about scaling clusters have a look at the reference: https://cloud.google.com/dataproc/docs/
concepts/configuring-clusters/scaling-clusters
Parameters
• cluster_name (string) – The name of the cluster to scale. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – The region for the dataproc cluster. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• num_workers (int) – The new number of workers
• num_preemptible_workers (int) – The new number of preemptible workers
• graceful_decommission_timeout (string) – Timeout for graceful YARN de-
comissioning. Maximum value is 1d
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
DataprocClusterDeleteOperator
class airflow.contrib.operators.dataproc_operator.DataprocClusterDeleteOperator(**kwargs)
Bases: airflow.models.BaseOperator
Delete a cluster on Google Cloud Dataproc. The operator will wait until the cluster is destroyed.
Parameters
• cluster_name (string) – The name of the cluster to create. (templated)
• project_id (string) – The ID of the google cloud project in which the cluster runs.
(templated)
• region (string) – leave as ‘global’, might become relevant in the future. (templated)
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
DataProcPigOperator
class airflow.contrib.operators.dataproc_operator.DataProcPigOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.
It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and
UDFs.
default_args = {
'cluster_name': 'cluster-1',
'dataproc_pig_jars': [
'gs://example/udf/jar/datafu/1.2.0/datafu.jar',
'gs://example/udf/jar/gpig/1.2/gpig.jar'
]
}
You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be
resolved on the cluster or use the parameters to be resolved in the script as template parameters.
Example:
t1 = DataProcPigOperator(
task_id='dataproc_pig',
query='a_pig_script.pig',
variables={'out': 'gs://example/output/{{ds}}'},
dag=dag)
See also:
For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/
reference/rest/v1/projects.regions.jobs
Parameters
• query (string) – The query or reference to the query file (pg or pig extension). (tem-
plated)
• query_uri (string) – The uri of a pig script on Cloud Storage.
• variables (dict) – Map of named parameters for the query. (templated)
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_pig_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.
DataProcHiveOperator
class airflow.contrib.operators.dataproc_operator.DataProcHiveOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Hive query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension).
• query_uri (string) – The uri of a hive script on Cloud Storage.
• variables (dict) – Map of named parameters for the query.
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes.
• cluster_name (string) – The name of the DataProc cluster.
• dataproc_hive_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example:
for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.
DataProcSparkSqlOperator
class airflow.contrib.operators.dataproc_operator.DataProcSparkSqlOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Spark SQL query Job on a Cloud DataProc cluster.
Parameters
• query (string) – The query or reference to the query file (q extension). (templated)
• query_uri (string) – The uri of a spark sql script on Cloud Storage.
DataProcSparkOperator
class airflow.contrib.operators.dataproc_operator.DataProcSparkOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Spark Job on a Cloud DataProc cluster.
Parameters
• main_jar (string) – URI of the job jar provisioned on Cloud Storage. (use this or the
main_class, not both together).
• main_class (string) – Name of the job class. (use this or the main_jar, not both
together).
• arguments (list) – Arguments for the job. (templated)
• archives (list) – List of archived files that will be unpacked in the work directory.
Should be stored in Cloud Storage.
• files (list) – List of files to be copied to the working directory
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_spark_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
DataProcHadoopOperator
class airflow.contrib.operators.dataproc_operator.DataProcHadoopOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Hadoop Job on a Cloud DataProc cluster.
Parameters
• main_jar (string) – URI of the job jar provisioned on Cloud Storage. (use this or the
main_class, not both together).
• main_class (string) – Name of the job class. (use this or the main_jar, not both
together).
• arguments (list) – Arguments for the job. (templated)
• archives (list) – List of archived files that will be unpacked in the work directory.
Should be stored in Cloud Storage.
• files (list) – List of files to be copied to the working directory
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster. (templated)
• dataproc_hadoop_properties (dict) – Map for the Pig properties. Ideal to put in
default arguments
• dataproc_hadoop_jars (list) – URIs to jars provisioned in Cloud Storage (exam-
ple: for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.
DataProcPySparkOperator
class airflow.contrib.operators.dataproc_operator.DataProcPySparkOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a PySpark Job on a Cloud DataProc cluster.
Parameters
• main (string) – [Required] The Hadoop Compatible Filesystem (HCFS) URI of the
main Python file to use as the driver. Must be a .py file.
• arguments (list) – Arguments for the job. (templated)
• archives (list) – List of archived files that will be unpacked in the work directory.
Should be stored in Cloud Storage.
• files (list) – List of files to be copied to the working directory
• pyfiles (list) – List of Python files to pass to the PySpark framework. Supported file
types: .py, .egg, and .zip
• job_name (string) – The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The name will always
be appended with a random number to avoid name clashes. (templated)
• cluster_name (string) – The name of the DataProc cluster.
• dataproc_pyspark_properties (dict) – Map for the Pig properties. Ideal to put
in default arguments
• dataproc_pyspark_jars (list) – URIs to jars provisioned in Cloud Storage (exam-
ple: for UDFs and libs) and are ideal to put in default arguments.
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• region (str) – The specified region where the dataproc cluster is created.
• job_error_states (list) – Job states that should be considered error states. Any
states in this list will result in an error being raised and failure of the task. Eg, if
the CANCELLED state should also be considered a task failure, pass in ['ERROR',
'CANCELLED']. Possible values are currently only 'ERROR' and 'CANCELLED', but
could change in the future. Defaults to ['ERROR'].
Variables dataproc_job_id (string) – The actual “jobId” as submitted to the Dataproc API.
This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI, as
the actual “jobId” submitted to the Dataproc API is appended with an 8 character random string.
DataprocWorkflowTemplateInstantiateOperator
class airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateOperat
Bases: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate on Google Cloud Dataproc. The operator will wait until the WorkflowTemplate
is finished executing.
See also:
Please refer to: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.
workflowTemplates/instantiate
Parameters
• template_id (string) – The id of the template. (templated)
• project_id (string) – The ID of the google cloud project in which the template runs
• region (string) – leave as ‘global’, might become relevant in the future
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
DataprocWorkflowTemplateInstantiateInlineOperator
class airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateInline
Bases: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc. The operator will wait until the Work-
flowTemplate is finished executing.
See also:
Please refer to: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.
workflowTemplates/instantiateInline
Parameters
• template (map) – The template contents. (templated)
• project_id (string) – The ID of the google cloud project in which the template runs
• region (string) – leave as ‘global’, might become relevant in the future
• gcp_conn_id (string) – The connection ID to use connecting to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
Datastore Operators
DatastoreExportOperator
class airflow.contrib.operators.datastore_export_operator.DatastoreExportOperator(**kwargs)
Bases: airflow.models.BaseOperator
Export entities from Google Cloud Datastore to Cloud Storage
Parameters
• bucket (string) – name of the cloud storage bucket to backup data
• namespace (str) – optional namespace path in the specified Cloud Storage bucket to
backup data. If this namespace does not exist in GCS, it will be created.
• datastore_conn_id (string) – the name of the Datastore connection id to use
• cloud_storage_conn_id (string) – the name of the cloud storage connection id to
force-write backup
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• entity_filter (dict) – description of what data from the project is included in
the export, refer to https://cloud.google.com/datastore/docs/reference/rest/Shared.Types/
EntityFilter
• labels (dict) – client-assigned labels for cloud storage
• polling_interval_in_seconds (int) – number of seconds to wait before polling
for execution status again
• overwrite_existing (bool) – if the storage bucket + namespace is not empty, it will
be emptied prior to exports. This enables overwriting existing backups.
• xcom_push (bool) – push operation name to xcom for reference
DatastoreImportOperator
class airflow.contrib.operators.datastore_import_operator.DatastoreImportOperator(**kwargs)
Bases: airflow.models.BaseOperator
Import entities from Cloud Storage to Google Cloud Datastore
Parameters
• bucket (string) – container in Cloud Storage to store data
• file (string) – path of the backup metadata file in the specified Cloud Storage bucket.
It should have the extension .overall_export_metadata
• namespace (str) – optional namespace of the backup metadata file in the specified Cloud
Storage bucket.
• entity_filter (dict) – description of what data from the project is included in
the export, refer to https://cloud.google.com/datastore/docs/reference/rest/Shared.Types/
EntityFilter
• labels (dict) – client-assigned labels for cloud storage
• datastore_conn_id (string) – the name of the connection id to use
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• polling_interval_in_seconds (int) – number of seconds to wait before polling
for execution status again
• xcom_push (bool) – push operation name to xcom for reference
DatastoreHook
class airflow.contrib.hooks.datastore_hook.DatastoreHook(datastore_conn_id=’google_cloud_datastore_defa
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Interact with Google Cloud Datastore. This hook uses the Google Cloud Platform connection.
This object is not threads safe. If you want to make multiple requests simultaneously, you will need to create a
hook per thread.
allocate_ids(partialKeys)
Allocate IDs for incomplete keys. see https://cloud.google.com/datastore/docs/reference/rest/v1/projects/
allocateIds
Parameters partialKeys – a list of partial keys
Returns a list of full keys.
begin_transaction()
Get a new transaction handle
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/beginTransaction
commit(body)
Commit a transaction, optionally creating, deleting or modifying some entities.
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/commit
delete_operation(name)
Deletes the long-running operation
Parameters name – the name of the operation resource
export_to_storage_bucket(bucket, namespace=None, entity_filter=None, labels=None)
Export entities from Cloud Datastore to Cloud Storage for backup
get_conn(version=’v1’)
Returns a Google Cloud Datastore service object.
get_operation(name)
Gets the latest state of a long-running operation
Parameters name – the name of the operation resource
Parameters
• keys – the keys to lookup
• read_consistency – the read consistency to use. default, strong or eventual. Cannot
be used with a transaction.
• transaction – the transaction to use, if any.
Returns the response body of the lookup request.
poll_operation_until_done(name, polling_interval_in_seconds)
Poll backup operation state until it’s completed
rollback(transaction)
Roll back a transaction
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/rollback
run_query(body)
Run a query for entities.
See also:
https://cloud.google.com/datastore/docs/reference/rest/v1/projects/runQuery
MLEngineBatchPredictionOperator
class airflow.contrib.operators.mlengine_operator.MLEngineBatchPredictionOperator(**kwargs)
Bases: airflow.models.BaseOperator
Start a Google Cloud ML Engine prediction job.
NOTE: For model origin, users should consider exactly one from the three options below: 1. Populate ‘uri’
field only, which should be a GCS location that points to a tensorflow savedModel directory. 2. Populate
‘model_name’ field only, which refers to an existing model, and the default version of the model will be used. 3.
Populate both ‘model_name’ and ‘version_name’ fields, which refers to a specific version of a specific model.
In options 2 and 3, both model and version name should contain the minimal identifier. For instance, call
MLEngineBatchPredictionOperator(
...,
model_name='my_model',
version_name='my_version',
...)
• gcp_conn_id (string) – The connection ID used for connection to Google Cloud Plat-
form.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have doamin-wide delegation enabled.
MLEngineModelOperator
class airflow.contrib.operators.mlengine_operator.MLEngineModelOperator(**kwargs)
Bases: airflow.models.BaseOperator
Operator for managing a Google Cloud ML Engine model.
Parameters
• project_id (string) – The Google Cloud project name to which MLEngine model
belongs. (templated)
• model (dict) – A dictionary containing the information about the model. If the operation
is create, then the model parameter should contain all the information about this model such
as name.
If the operation is get, the model parameter should contain the name of the model.
• operation (string) – The operation to perform. Available operations are:
– create: Creates a new model as provided by the model parameter.
– get: Gets a particular model where the name is specified in model.
• gcp_conn_id (string) – The connection ID to use when fetching connection info.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
MLEngineTrainingOperator
class airflow.contrib.operators.mlengine_operator.MLEngineTrainingOperator(**kwargs)
Bases: airflow.models.BaseOperator
Operator for launching a MLEngine training job.
Parameters
• project_id (string) – The Google Cloud project name within which MLEngine train-
ing job should run (templated).
• job_id (string) – A unique templated id for the submitted Google MLEngine training
job. (templated)
• package_uris (string) – A list of package locations for MLEngine training job,
which should include the main training program + any additional dependencies. (templated)
• training_python_module (string) – The Python module name to run within
MLEngine training job after installing ‘package_uris’ packages. (templated)
• training_args (string) – A list of templated command line arguments to pass to the
MLEngine training program. (templated)
• region (string) – The Google Compute Engine region to run the MLEngine training
job in (templated).
• scale_tier (string) – Resource tier for MLEngine training job. (templated)
• runtime_version (string) – The Google Cloud ML runtime version to use for train-
ing. (templated)
• python_version (string) – The version of Python used in training. (templated)
• job_dir (string) – A Google Cloud Storage path in which to store training outputs and
other data needed for training. (templated)
• gcp_conn_id (string) – The connection ID to use when fetching connection info.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• mode (string) – Can be one of ‘DRY_RUN’/’CLOUD’. In ‘DRY_RUN’ mode, no real
training job will be launched, but the MLEngine training job request will be printed out. In
‘CLOUD’ mode, a real MLEngine training job creation request will be issued.
MLEngineVersionOperator
class airflow.contrib.operators.mlengine_operator.MLEngineVersionOperator(**kwargs)
Bases: airflow.models.BaseOperator
Operator for managing a Google Cloud ML Engine version.
Parameters
• project_id (string) – The Google Cloud project name to which MLEngine model
belongs.
• model_name (string) – The name of the Google Cloud ML Engine model that the
version belongs to. (templated)
• version_name (string) – A name to use for the version being operated upon. If not
None and the version argument is None or does not have a value for the name key, then this
will be populated in the payload for the name key. (templated)
• version (dict) – A dictionary containing the information about the version. If the oper-
ation is create, version should contain all the information about this version such as name,
and deploymentUrl. If the operation is get or delete, the version parameter should contain
the name of the version. If it is None, the only operation possible would be list. (templated)
• operation (string) – The operation to perform. Available operations are:
– create: Creates a new version in the model specified by model_name, in which case
the version parameter should contain all the information to create that version (e.g. name,
deploymentUrl).
– get: Gets full information of a particular version in the model specified by model_name.
The name of the version should be specified in the version parameter.
– list: Lists all available versions of the model specified by model_name.
– delete: Deletes the version specified in version parameter from the model specified by
model_name). The name of the version should be specified in the version parameter.
• gcp_conn_id (string) – The connection ID to use when fetching connection info.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
MLEngineHook
class airflow.contrib.hooks.gcp_mlengine_hook.MLEngineHook(gcp_conn_id=’google_cloud_default’,
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
create_job(project_id, job, use_existing_job_fn=None)
Launches a MLEngine job and wait for it to reach a terminal state.
Parameters
• project_id (string) – The Google Cloud project id within which MLEngine job
will be launched.
• job (dict) – MLEngine Job object that should be provided to the MLEngine API, such
as:
{
'jobId': 'my_job_id',
'trainingInput': {
'scaleTier': 'STANDARD_1',
...
}
}
list_versions(project_id, model_name)
Lists all available versions of a model. Blocks until finished.
set_default_version(project_id, model_name, version_name)
Sets a version to be the default. Blocks until finished.
Storage Operators
FileToGoogleCloudStorageOperator
class airflow.contrib.operators.file_to_gcs.FileToGoogleCloudStorageOperator(**kwargs)
Bases: airflow.models.BaseOperator
Uploads a file to Google Cloud Storage. Optionally can compress the file for upload.
Parameters
• src (string) – Path to the local file. (templated)
• dst (string) – Destination path within the specified bucket. (templated)
• bucket (string) – The bucket to upload to. (templated)
• google_cloud_storage_conn_id (string) – The Airflow connection ID to up-
load with
• mime_type (string) – The mime-type string
• delegate_to (str) – The account to impersonate, if any
• gzip (bool) – Allows for file to be compressed and uploaded as gzip
execute(context)
Uploads the file to Google cloud storage
GoogleCloudStorageBucketCreateAclEntryOperator
class airflow.contrib.operators.gcs_acl_operator.GoogleCloudStorageBucketCreateAclEntryOper
Bases: airflow.models.BaseOperator
Creates a new ACL entry on the specified bucket.
Parameters
• bucket (str) – Name of a bucket.
• entity (str) – The entity holding the permission, in one of the following forms: user-
userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId,
allUsers, allAuthenticatedUsers
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”, “WRITER”.
• user_project (str) – (Optional) The project to be billed for this request. Required for
Requester Pays buckets.
• google_cloud_storage_conn_id (str) – The connection ID to use when connect-
ing to Google Cloud Storage.
GoogleCloudStorageCreateBucketOperator
class airflow.contrib.operators.gcs_operator.GoogleCloudStorageCreateBucketOperator(**kwargs
Bases: airflow.models.BaseOperator
Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name
that is already in use.
See also:
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/
bucketnaming.html#requirements
Parameters
• bucket_name (string) – The name of the bucket. (templated)
• storage_class (string) – This defines how objects in the bucket are stored and de-
termines the SLA and the cost of storage (templated). Values include
– MULTI_REGIONAL
– REGIONAL
– STANDARD
– NEARLINE
– COLDLINE.
If this value is not specified when the bucket is created, it will default to STANDARD.
• location (string) – The location of the bucket. (templated) Object data for objects in
the bucket resides in physical storage within this region. Defaults to US.
See also:
https://developers.google.com/storage/docs/bucket-locations
• project_id (string) – The ID of the GCP Project. (templated)
Example: The following Operator would create a new bucket test-bucket with MULTI_REGIONAL stor-
age class in EU region
CreateBucket = GoogleCloudStorageCreateBucketOperator(
task_id='CreateNewBucket',
bucket_name='test-bucket',
storage_class='MULTI_REGIONAL',
location='EU',
labels={'env': 'dev', 'team': 'airflow'},
google_cloud_storage_conn_id='airflow-service-account'
)
GoogleCloudStorageDownloadOperator
class airflow.contrib.operators.gcs_download_operator.GoogleCloudStorageDownloadOperator(**
Bases: airflow.models.BaseOperator
Downloads a file from Google Cloud Storage.
Parameters
• bucket (string) – The Google cloud storage bucket where the object is. (templated)
• object (string) – The name of the object to download in the Google cloud storage
bucket. (templated)
• filename (string) – The file path on the local file system (where the operator is being
executed) that the file should be downloaded to. (templated) If no filename passed, the
downloaded data will not be stored on the local file system.
• store_to_xcom_key (string) – If this param is set, the operator will push the con-
tents of the downloaded file to XCom with the key set in this parameter. If not set, the
downloaded data will not be pushed to XCom. (templated)
• google_cloud_storage_conn_id (string) – The connection ID to use when con-
necting to Google cloud storage.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
GoogleCloudStorageListOperator
class airflow.contrib.operators.gcs_list_operator.GoogleCloudStorageListOperator(**kwargs)
Bases: airflow.models.BaseOperator
List all objects from the bucket with the give string prefix and delimiter in name.
This operator returns a python list with the name of objects which can be used by xcom in the down-
stream task.
Parameters
• bucket (string) – The Google cloud storage bucket to find the objects. (templated)
• prefix (string) – Prefix string which filters objects whose name begin with this prefix.
(templated)
• delimiter (string) – The delimiter by which you want to filter the objects. (templated)
For e.g to lists the CSV files from in a directory in GCS you would use delimiter=’.csv’.
• google_cloud_storage_conn_id (string) – The connection ID to use when con-
necting to Google cloud storage.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
Example: The following Operator would list all the Avro files from sales/sales-2017 folder in data
bucket.
GCS_Files = GoogleCloudStorageListOperator(
task_id='GCS_Files',
bucket='data',
prefix='sales/sales-2017/',
delimiter='.avro',
google_cloud_storage_conn_id=google_cloud_conn_id
)
GoogleCloudStorageObjectCreateAclEntryOperator
class airflow.contrib.operators.gcs_acl_operator.GoogleCloudStorageObjectCreateAclEntryOper
Bases: airflow.models.BaseOperator
Creates a new ACL entry on the specified object.
Parameters
• bucket (str) – Name of a bucket.
• object_name (str) – Name of the object. For information about how to URL encode ob-
ject names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding
• entity (str) – The entity holding the permission, in one of the following forms: user-
userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId,
allUsers, allAuthenticatedUsers
• role (str) – The access permission for the entity. Acceptable values are: “OWNER”,
“READER”.
• generation (str) – (Optional) If present, selects a specific revision of this object (as
opposed to the latest version, the default).
• user_project (str) – (Optional) The project to be billed for this request. Required for
Requester Pays buckets.
• google_cloud_storage_conn_id (str) – The connection ID to use when connect-
ing to Google Cloud Storage.
GoogleCloudStorageToBigQueryOperator
class airflow.contrib.operators.gcs_to_bq.GoogleCloudStorageToBigQueryOperator(**kwargs)
Bases: airflow.models.BaseOperator
Loads files from Google cloud storage into BigQuery.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly
pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in
Google cloud storage must be a JSON file with the schema fields in it.
Parameters
• bucket (string) – The bucket to load from. (templated)
• source_objects (list of str) – List of Google cloud storage URIs to load from.
(templated) If source_format is ‘DATASTORE_BACKUP’, the list must only contain a sin-
gle URI.
• destination_project_dataset_table (string) – The dotted
(<project>.)<dataset>.<table> BigQuery table to load data into. If <project> is not
included, project will be the project defined in the connection json. (templated)
• schema_fields (list) – If set, the schema field list as defined here: https://cloud.
google.com/bigquery/docs/reference/v2/jobs#configuration.load Should not be set when
source_format is ‘DATASTORE_BACKUP’.
• schema_object (string) – If set, a GCS object path pointing to a .json file that con-
tains the schema for the table. (templated)
• source_format (string) – File format to export.
• compression (string) – [Optional] The compression type of the data source. Possible
values include GZIP and NONE. The default value is NONE. This setting is ignored for
Google Cloud Bigtable, Google Cloud Datastore backups and Avro formats.
• create_disposition (string) – The create disposition if the table doesn’t exist.
• skip_leading_rows (int) – Number of rows to skip when loading from a CSV.
• write_disposition (string) – The write disposition if the table already exists.
• field_delimiter (string) – The delimiter to use when loading from a CSV.
• max_bad_records (int) – The maximum number of bad records that BigQuery can
ignore when running the job.
• quote_character (string) – The value that is used to quote data sections in a CSV
file.
• ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow
extra values that are not represented in the table schema. If true, the extra values are ignored.
If false, records with extra columns are treated as bad records, and if there are too many bad
records, an invalid error is returned in the job result.
• allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not
(false).
• allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns.
The missing values are treated as nulls. If false, records with missing trailing columns are
treated as bad records, and if there are too many bad records, an invalid error is returned in
the job result. Only applicable to CSV, ignored for other formats.
• max_id_key (string) – If set, the name of a column in the BigQuery table that’s to be
loaded. This will be used to select the MAX value from BigQuery after the load occurs. The
results will be returned by the execute() command, which in turn gets stored in XCom for
future operators to use. This can be helpful with incremental loads–during future executions,
you can pick up from the max ID.
• bigquery_conn_id (string) – Reference to a specific BigQuery hook.
• google_cloud_storage_conn_id (string) – Reference to a specific Google
cloud storage hook.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
• schema_update_options (list) – Allows the schema of the destination table to be
updated as a side effect of the load job.
• src_fmt_configs (dict) – configure optional fields specific to the source format
• external_table (bool) – Flag to specify if the destination table should be a BigQuery
external table. Default Value is False.
• time_partitioning (dict) – configure optional time partitioning fields i.e. partition
by field, type and expiration as per API specifications. Note that ‘field’ is not available in
concurrency with dataset.table$partition.
• cluster_fields (list of str) – Request that the result of this load be stored sorted
by one or more columns. This is only available in conjunction with time_partitioning. The
order of columns given determines the sort order. Not applicable for external tables.
GoogleCloudStorageToGoogleCloudStorageOperator
class airflow.contrib.operators.gcs_to_gcs.GoogleCloudStorageToGoogleCloudStorageOperator(*
Bases: airflow.models.BaseOperator
Copies objects from a bucket to another, with renaming if requested.
Parameters
• source_bucket (string) – The source Google cloud storage bucket where the object
is. (templated)
• source_object (string) – The source name of the object to copy in the Google cloud
storage bucket. (templated) If wildcards are used in this argument:
You can use only one wildcard for objects (filenames) within your bucket. The wildcard
can appear inside the object name or at the end of the object name. Appending a
wildcard to the bucket name is unsupported.
• destination_bucket (string) – The destination Google cloud storage bucket where
the object should be. (templated)
• destination_object (string) – The destination name of the object in the destina-
tion Google cloud storage bucket. (templated) If a wildcard is supplied in the source_object
argument, this is the prefix that will be prepended to the final destination objects’ paths.
Note that the source path’s part before the wildcard will be removed; if it needs to be re-
tained it should be appended to destination_object. For example, with prefix foo/* and
destination_object blah/, the file foo/baz will be copied to blah/baz; to retain the
prefix write the destination_object as e.g. blah/foo, in which case the copied file will be
named blah/foo/baz.
• move_object (bool) – When move object is True, the object is moved instead of copied
to the new location. This is the equivalent of a mv command as opposed to a cp command.
• google_cloud_storage_conn_id (string) – The connection ID to use when con-
necting to Google cloud storage.
• delegate_to (string) – The account to impersonate, if any. For this to work, the
service account making the request must have domain-wide delegation enabled.
Examples: The following Operator would copy a single file named sales/sales-2017/january.avro
in the data bucket to the file named copied_sales/2017/january-backup.avro` in the
``data_backup bucket
copy_single_file = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='copy_single_file',
source_bucket='data',
source_object='sales/sales-2017/january.avro',
destination_bucket='data_backup',
destination_object='copied_sales/2017/january-backup.avro',
google_cloud_storage_conn_id=google_cloud_conn_id
)
The following Operator would copy all the Avro files from sales/sales-2017 folder (i.e. with names
starting with that prefix) in data bucket to the copied_sales/2017 folder in the data_backup
bucket.
copy_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='copy_files',
source_bucket='data',
source_object='sales/sales-2017/*.avro',
destination_bucket='data_backup',
destination_object='copied_sales/2017/',
google_cloud_storage_conn_id=google_cloud_conn_id
)
The following Operator would move all the Avro files from sales/sales-2017 folder (i.e. with names
starting with that prefix) in data bucket to the same folder in the data_backup bucket, deleting the
original files in the process.
move_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='move_files',
source_bucket='data',
source_object='sales/sales-2017/*.avro',
destination_bucket='data_backup',
move_object=True,
google_cloud_storage_conn_id=google_cloud_conn_id
)
GoogleCloudStorageToGoogleCloudStorageTransferOperator
class airflow.contrib.operators.gcs_to_gcs_transfer_operator.GoogleCloudStorageToGoogleClou
Bases: airflow.models.BaseOperator
Copies objects from a bucket to another using the GCP Storage Transfer Service.
Parameters
• source_bucket (str) – The source Google cloud storage bucket where the object is.
(templated)
• destination_bucket (str) – The destination Google cloud storage bucket where the
object should be. (templated)
• project_id (str) – The ID of the Google Cloud Platform Console project that owns the
job
• gcp_conn_id (str) – Optional connection ID to use when connecting to Google Cloud
Storage.
• delegate_to (str) – The account to impersonate, if any. For this to work, the service
account making the request must have domain-wide delegation enabled.
• description (str) – Optional transfer service job description
• schedule (dict) – Optional transfer service schedule; see https://cloud.google.com/
storage-transfer/docs/reference/rest/v1/transferJobs. If not set, run transfer job once as soon
as the operator runs
• object_conditions (dict) – Optional transfer service object conditions; see https://
cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#ObjectConditions
• transfer_options (dict) – Optional transfer service transfer options; see https://
cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#TransferOptions
• wait (bool) – Wait for transfer to finish; defaults to True
Example:
gcs_to_gcs_transfer_op = GoogleCloudStorageToGoogleCloudStorageTransferOperator(
task_id='gcs_to_gcs_transfer_example',
source_bucket='my-source-bucket',
destination_bucket='my-destination-bucket',
project_id='my-gcp-project',
dag=my_dag)
MySqlToGoogleCloudStorageOperator
GoogleCloudStorageHook
class airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook(google_cloud_storage_conn_id=’google_clou
delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Interact with Google Cloud Storage. This hook uses the Google Cloud Platform connection.
copy(source_bucket, source_object, destination_bucket=None, destination_object=None)
Copies an object from a bucket to another, with renaming if requested.
destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but
not both.
Parameters
• source_bucket (string) – The bucket of the object to copy from.
• source_object (string) – The object to copy.
• destination_bucket (string) – The destination of the object to copied to. Can
be omitted; then the same bucket is used.
Parameters
• bucket_name (string) – The name of the bucket.
• storage_class (string) – This defines how objects in the bucket are stored and
determines the SLA and the cost of storage. Values include
– MULTI_REGIONAL
– REGIONAL
– STANDARD
– NEARLINE
– COLDLINE.
If this value is not specified when the bucket is created, it will default to STANDARD.
• location (string) – The location of the bucket. Object data for objects in the bucket
resides in physical storage within this region. Defaults to US.
See also:
https://developers.google.com/storage/docs/bucket-locations
• project_id (string) – The ID of the GCP Project.
• labels (dict) – User-provided labels, in key/value pairs.
Returns If successful, it returns the id of the bucket.