Luigi Readthedocs Io en Stable
Luigi Readthedocs Io en Stable
Release 2.8.13
1 Background 3
2 Visualiser page 5
4 Philosophy 9
6 External links 15
7 Authors 17
8 Table of Contents 19
8.1 Example – Top Artists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8.2 Building workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.4 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5 Running Luigi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.6 Using the Central Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.7 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.8 Luigi Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.9 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.10 Configure logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.11 Design and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9 API Reference 63
9.1 luigi package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2 Indices and tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Index 251
i
ii
Luigi Documentation, Release 2.8.13
Luigi is a Python (2.7, 3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles
dependency resolution, workflow management, visualization, handling failures, command line integration, and much
more.
Run pip install luigi to install the latest stable version from PyPI. Documentation for the latest release is
hosted on readthedocs.
Run pip install luigi[toml] to install Luigi with TOML-based configs support.
For the bleeding edge code, pip install git+https://github.com/spotify/luigi.git. Bleeding
edge documentation is also available.
Contents 1
Luigi Documentation, Release 2.8.13
2 Contents
CHAPTER 1
Background
The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want
to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long
running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything
else.
There are other software packages that focus on lower level aspects of data processing, like Hive, Pig, or Cascading.
Luigi is not a framework to replace these. Instead it helps you stitch many tasks together, where each task can be a
Hive query, a Hadoop job in Java, a Spark job in Scala or Python, a Python snippet, dumping a table from a database,
or anything else. It’s easy to build up long-running pipelines that comprise thousands of tasks and take days or weeks
to complete. Luigi takes care of a lot of the workflow management so that you can focus on the tasks themselves and
their dependencies.
You can build pretty much any task you want, but Luigi also comes with a toolbox of several common task templates
that you use. It includes support for running Python mapreduce jobs in Hadoop, as well as Hive, and Pig, jobs. It also
comes with file system abstractions for HDFS, and local files that ensures all file system operations are atomic. This
is important because it means your data pipeline will not crash in a state containing partial data.
3
Luigi Documentation, Release 2.8.13
4 Chapter 1. Background
CHAPTER 2
Visualiser page
The Luigi server comes with a web interface too, so you can search and filter among all your tasks.
5
Luigi Documentation, Release 2.8.13
Just to give you an idea of what Luigi does, this is a screen shot from something we are running in production. Using
Luigi’s visualiser, we get a nice visual overview of the dependency graph of the workflow. Each node represents a task
which has to be run. Green tasks are already completed whereas yellow tasks are yet to be run. Most of these tasks
are Hadoop jobs, but there are also some things that run locally and build up data files.
7
Luigi Documentation, Release 2.8.13
Philosophy
Conceptually, Luigi is similar to GNU Make where you have certain tasks and these tasks in turn may have dependen-
cies on other tasks. There are also some similarities to Oozie and Azkaban. One major difference is that Luigi is not
just built specifically for Hadoop, and it’s easy to extend it with other kinds of tasks.
Everything in Luigi is in Python. Instead of XML configuration or similar external data files, the dependency graph is
specified within Python. This makes it easy to build up complex dependency graphs of tasks, where the dependencies
can involve date algebra or recursive references to other versions of the same task. However, the workflow can trigger
things not in Python, such as running Pig scripts or scp’ing files.
9
Luigi Documentation, Release 2.8.13
10 Chapter 4. Philosophy
CHAPTER 5
We use Luigi internally at Spotify to run thousands of tasks every day, organized in complex dependency graphs. Most
of these tasks are Hadoop jobs. Luigi provides an infrastructure that powers all kinds of stuff including recommenda-
tions, toplists, A/B test analysis, external reports, internal dashboards, etc.
Since Luigi is open source and without any registration walls, the exact number of Luigi users is unknown. But based
on the number of unique contributors, we expect hundreds of enterprises to use it. Some users have written blog posts
or held presentations about Luigi:
• Spotify (presentation, 2014)
• Foursquare (presentation, 2013)
• Mortar Data (Datadog) (documentation / tutorial)
• Stripe (presentation, 2014)
• Asana (blog, 2014)
• Buffer (blog, 2014)
• SeatGeek (blog, 2015)
• Treasure Data (blog, 2015)
• Growth Intelligence (presentation, 2015)
• AdRoll (blog, 2015)
• 17zuoye (presentation, 2015)
• Custobar (presentation, 2016)
• Blendle (presentation)
• TrustYou (presentation, 2015)
• Groupon / OrderUp (alternative implementation)
• Red Hat - Marketing Operations (blog, 2017)
• GetNinjas (blog, 2017)
11
Luigi Documentation, Release 2.8.13
• Skyscanner
• Jodel
• Mekar
• M3
We’re more than happy to have your company added here. Just send a PR on GitHub.
13
Luigi Documentation, Release 2.8.13
External links
15
Luigi Documentation, Release 2.8.13
Authors
Luigi was built at Spotify, mainly by Erik Bernhardsson and Elias Freider. Many other people have contributed since
open sourcing in late 2012. Arash Rouhani is currently the chief maintainer of Luigi.
17
Luigi Documentation, Release 2.8.13
18 Chapter 7. Authors
CHAPTER 8
Table of Contents
This is a very simplified case of something we do at Spotify a lot. All user actions are logged to Google Cloud Storage
(previously HDFS) where we run a bunch of processing jobs to transform the data. The processing code itself is
implemented in a scalable data processing framework, such as Scio, Scalding, or Spark, but the jobs are orchestrated
with Luigi. At some point we might end up with a smaller data set that we can bulk ingest into Cassandra, Postgres,
or other storage suitable for serving or exploration.
For the purpose of this exercise, we want to aggregate all streams, find the top 10 artists and then put the results into
Postgres.
This example is also available in examples/top_artists.py.
class AggregateArtists(luigi.Task):
date_interval = luigi.DateIntervalParameter()
def output(self):
return luigi.LocalTarget("data/artist_streams_%s.tsv" % self.date_interval)
def requires(self):
return [Streams(date) for date in self.date_interval]
def run(self):
artist_count = defaultdict(int)
19
Luigi Documentation, Release 2.8.13
Note that this is just a portion of the file examples/top_artists.py. In particular, Streams is defined as
a Task, acting as a dependency for AggregateArtists. In addition, luigi.run() is called if the script is
executed directly, allowing it to be run from the command line.
There are several pieces of this snippet that deserve more explanation.
• Any Task may be customized by instantiating one or more Parameter objects on the class level.
• The output() method tells Luigi where the result of running the task will end up. The path can be some
function of the parameters.
• The requires() tasks specifies other tasks that we need to perform this task. In this case it’s an external
dump named Streams which takes the date as the argument.
• For plain Tasks, the run() method implements the task. This could be anything, including calling subprocesses,
performing long running number crunching, etc. For some subclasses of Task you don’t have to implement the
run method. For instance, for the JobTask subclass you implement a mapper and reducer instead.
• LocalTarget is a built in class that makes it easy to read/write from/to the local filesystem. It also makes all
file operations atomic, which is nice in case your script crashes for any reason.
$ cd examples
$ luigi --module top_artists AggregateArtists --local-scheduler --date-interval 2012-
˓→06
Note that top_artists needs to be in your PYTHONPATH, or else this can produce an error (ImportError: No module
named top_artists). Add the current working directory to the command PYTHONPATH with:
You can also try to view the manual using --help which will give you an overview of the options.
Running the command again will do nothing because the output file is already created. In that sense, any task in Luigi
is idempotent because running it many times gives the same outcome as running it once. Note that unlike Makefile,
the output will not be recreated when any of the input files is modified. You need to delete the output file manually.
The --local-scheduler flag tells Luigi not to connect to a scheduler server. This is not recommended for other
purpose than just testing things.
While Luigi can process data inline, it is normally used to orchestrate external programs that perform the actual
processing. In this example, we will demonstrate how top artists instead can be read from HDFS and calculated with
Spark, orchestrated by Luigi.
class AggregateArtistsSpark(luigi.contrib.spark.SparkSubmitTask):
date_interval = luigi.DateIntervalParameter()
app = 'top_artists_spark.py'
master = 'local[*]'
def output(self):
return luigi.contrib.hdfs.HdfsTarget("data/artist_streams_%s.tsv" % self.date_
˓→interval)
def requires(self):
return [StreamsHdfs(date) for date in self.date_interval]
def app_options(self):
# :func:`~luigi.task.Task.input` returns the targets produced by the tasks in
# `~luigi.task.Task.requires`.
return [','.join([p.path for p in self.input()]),
self.output().path]
import operator
import sys
from pyspark.sql import SparkSession
def main(argv):
input_paths = argv[1].split(',')
output_path = argv[2]
spark = SparkSession.builder.getOrCreate()
counts.write.option('sep', '\t').csv(output_path)
if __name__ == '__main__':
sys.exit(main(sys.argv))
In a typical deployment scenario, the Luigi orchestration definition above as well as the Pyspark processing code
would be packaged into a deployment package, such as a container image. The processing code does not have to be
implemented in Python, any program can be packaged in the image and run from Luigi.
At this point, we’ve counted the number of streams for each artists, for the full time period. We are left with a large
file that contains mappings of artist -> count data, and we want to find the top 10 artists. Since we only have a few
hundred thousand artists, and calculating artists is nontrivial to parallelize, we choose to do this not as a Hadoop job,
but just as a plain old for-loop in Python.
class Top10Artists(luigi.Task):
date_interval = luigi.DateIntervalParameter()
use_hadoop = luigi.BoolParameter()
def requires(self):
if self.use_hadoop:
return AggregateArtistsSpark(self.date_interval)
else:
return AggregateArtists(self.date_interval)
def output(self):
return luigi.LocalTarget("data/top_artists_%s.tsv" % self.date_interval)
def run(self):
top_10 = nlargest(10, self._input_iterator())
with self.output().open('w') as out_file:
for streams, artist in top_10:
print >> out_file, self.date_interval.date_a, self.date_interval.date_
˓→b, artist, streams
def _input_iterator(self):
with self.input().open('r') as in_file:
for line in in_file:
artist, streams = line.strip().split()
yield int(streams), int(artist)
The most interesting thing here is that this task (Top10Artists) defines a dependency on the previous task (Aggre-
gateArtists). This means that if the output of AggregateArtists does not exist, the task will run before Top10Artists.
This mainly serves as an example of a specific subclass Task that doesn’t require any code to be written. It’s also an
example of how you can define task templates that you can reuse for a lot of different tasks.
class ArtistToplistToDatabase(luigi.contrib.postgres.CopyToTable):
date_interval = luigi.DateIntervalParameter()
use_hadoop = luigi.BoolParameter()
host = "localhost"
database = "toplists"
user = "luigi"
password = "abc123" # ;)
table = "top10"
def requires(self):
return Top10Artists(self.date_interval, self.use_hadoop)
Just like previously, this defines a recursive dependency on the previous task. If you try to build the task, that will also
trigger building all its upstream dependencies.
The --local-scheduler flag tells Luigi not to connect to a central scheduler. This is recommended in order to
get started and or for development purposes. At the point where you start putting things in production we strongly
recommend running the central scheduler server. In addition to providing locking so that the same task is not run by
multiple processes at the same time, this server also provides a pretty nice visualization of your current work flow.
If you drop the --local-scheduler flag, your script will try to connect to the central planner, by default at
localhost port 8082. If you run
$ luigid
in the background and then run your task without the --local-scheduler flag, then your script will now schedule
through a centralized server. You need Tornado for this to work.
Launching http://localhost:8082 should show something like this:
Web server screenshot Looking at the dependency graph for any of the tasks yields something like this:
Aggregate artists screenshot
In production, you’ll want to run the centralized scheduler. See: Using the Central Scheduler for more information.
There are two fundamental building blocks of Luigi - the Task class and the Target class. Both are abstract classes
and expect a few methods to be implemented. In addition to those two concepts, the Parameter class is an important
concept that governs how a Task is run.
8.2.1 Target
The Target class corresponds to a file on a disk, a file on HDFS or some kind of a checkpoint, like an entry in a
database. Actually, the only method that Targets have to implement is the exists method which returns True if and only
if the Target exists.
In practice, implementing Target subclasses is rarely needed. Luigi comes with a toolbox of several useful Targets. In
particular, LocalTarget and HdfsTarget, but there is also support for other file systems: luigi.contrib.
s3.S3Target, luigi.contrib.ssh.RemoteTarget, luigi.contrib.ftp.RemoteTarget,
luigi.contrib.mysqldb.MySqlTarget, luigi.contrib.redshift.RedshiftTarget, and
several more.
Most of these targets, are file system-like. For instance, LocalTarget and HdfsTarget map to a file on the
local drive or a file in HDFS. In addition these also wrap the underlying operations to make them atomic. They both
implement the open() method which returns a stream object that could be read (mode='r') from or written to
(mode='w').
Luigi comes with Gzip support by providing format=format.Gzip. Adding support for other formats is pretty
simple.
8.2.2 Task
The Task class is a bit more conceptually interesting because this is where computation is done. There are a few
methods that can be implemented to alter its behavior, most notably run(), output() and requires().
Tasks consume Targets that were created by some other task. They usually also output targets:
You can define dependencies between Tasks using the requires() method. See Tasks for more info.
Each task defines its outputs using the output() method. Additionally, there is a helper method input() that
returns the corresponding Target classes for each Task dependency.
8.2.3 Parameter
The Task class corresponds to some type of job that is run, but in general you want to allow some form of parameter-
ization of it. For instance, if your Task class runs a Hadoop job to create a report every night, you probably want to
make the date a parameter of the class. See Parameters for more info.
8.2.4 Dependencies
Using tasks, targets, and parameters, Luigi lets you express arbitrary dependencies in code, rather than using some
kind of awkward config DSL. This is really useful because in the real world, dependencies are often very messy. For
instance, some examples of the dependencies you might encounter:
(These diagrams are from a Luigi presentation in late 2014 at NYC Data Science meetup)
8.3 Tasks
Tasks are where the execution takes place. Tasks depend on each other and output targets.
An outline of how a task can look like:
8.3.1 Task.requires
The requires() method is used to specify dependencies on other Task object, which might even be of the same
class. For instance, an example implementation could be
def requires(self):
return OtherTask(self.date), DailyReport(self.date - datetime.timedelta(1))
In this case, the DailyReport task depends on two inputs created earlier, one of which is the same class. requires can
return other Tasks in any way wrapped up within dicts/lists/tuples/etc.
Note that requires() can not return a Target object. If you have a simple Target object that is created externally
you can wrap it in a Task class like this:
class LogFiles(luigi.ExternalTask):
def output(self):
return luigi.contrib.hdfs.HdfsTarget('/log')
class LogFiles(luigi.ExternalTask):
date = luigi.DateParameter()
def output(self):
return luigi.contrib.hdfs.HdfsTarget(self.date.strftime('/log/%Y-%m-%d'))
8.3.3 Task.output
The output() method returns one or more Target objects. Similarly to requires, you can return them wrapped
up in any way that’s convenient for you. However we recommend that any Task only return one single Target in
output. If multiple outputs are returned, atomicity will be lost unless the Task itself can ensure that each Target is
atomically created. (If atomicity is not of concern, then it is safe to return multiple Target objects.)
class DailyReport(luigi.Task):
date = luigi.DateParameter()
def output(self):
return luigi.contrib.hdfs.HdfsTarget(self.date.strftime('/reports/%Y-%m-%d'))
# ...
8.3.4 Task.run
The run() method now contains the actual code that is run. When you are using Task.requires and Task.run Luigi
breaks down everything into two stages. First it figures out all dependencies between tasks, then it runs everything. The
input() method is an internal helper method that just replaces all Task objects in requires with their corresponding
output. An example:
class GenerateWords(luigi.Task):
def output(self):
return luigi.LocalTarget('words.txt')
def run(self):
with self.output().open('w') as f:
for word in words:
f.write('{word}\n'.format(word=word))
class CountLetters(luigi.Task):
def requires(self):
(continues on next page)
8.3. Tasks 29
Luigi Documentation, Release 2.8.13
def output(self):
return luigi.LocalTarget('letter_counts.txt')
def run(self):
# write each word to output file with its corresponding letter count
with self.output().open('w') as outfile:
for word in words:
outfile.write(
'{word} | {letter_count}\n'.format(
word=word,
letter_count=len(word)
)
)
It’s useful to note that if you’re writing to a binary file, Luigi automatically strips the 'b' flag due to how atomic
writes/reads work. In order to write a binary file, such as a pickle file, you should instead use format=Nop when
calling LocalTarget. Following the above example:
class GenerateWords(luigi.Task):
def output(self):
return luigi.LocalTarget('words.pckl', format=Nop)
def run(self):
import pickle
with self.output().open('w') as f:
pickle.dump(words, f)
8.3.5 Task.input
As seen in the example above, input() is a wrapper around Task.requires that returns the corresponding Target
objects instead of Task objects. Anything returned by Task.requires will be transformed, including lists, nested dicts,
etc. This can be useful if you have many dependencies:
class TaskWithManyInputs(luigi.Task):
def requires(self):
return {'a': TaskA(), 'b': [TaskB(i) for i in xrange(100)]}
def run(self):
(continues on next page)
Sometimes you might not know exactly what other tasks to depend on until runtime. In that case, Luigi provides a
mechanism to specify dynamic dependencies. If you yield another Task in the Task.run method, the current task will
be suspended and the other task will be run. You can also yield a list of tasks.
class MyTask(luigi.Task):
def run(self):
other_target = yield OtherTask()
This mechanism is an alternative to Task.requires in case you are not able to build up the full dependency graph before
running the task. It does come with some constraints: the Task.run method will resume from scratch each time a new
task is yielded. In other words, you should make sure your Task.run method is idempotent. (This is good practice for
all Tasks in Luigi, but especially so for tasks with dynamic dependencies).
For an example of a workflow using dynamic dependencies, see examples/dynamic_requirements.py.
For long-running or remote tasks it is convenient to see extended status information not only on the command line or
in your logs but also in the GUI of the central scheduler. Luigi implements dynamic status messages, progress bar and
tracking urls which may point to an external monitoring system. You can set this information using callbacks within
Task.run:
class MyTask(luigi.Task):
def run(self):
# set a tracking url
self.set_tracking_url("http://...")
Luigi has a built-in event system that allows you to register callbacks to events and trigger them from your own tasks.
You can both hook into some pre-defined events and create your own. Each event handle is tied to a Task class and
will be triggered only from that class or a subclass of it. This allows you to effortlessly subscribe to events only from
a specific class (e.g. for hadoop jobs).
8.3. Tasks 31
Luigi Documentation, Release 2.8.13
@luigi.Task.event_handler(luigi.Event.SUCCESS)
def celebrate_success(task):
"""Will be called directly after a successful execution
of `run` on any Task subclass (i.e. all luigi Tasks)
"""
...
@luigi.contrib.hadoop.JobTask.event_handler(luigi.Event.FAILURE)
def mourn_failure(task, exception):
"""Will be called directly after a failed execution
of `run` on any JobTask subclass
"""
...
luigi.run()
The Hadoop code is integrated in the rest of the Luigi code because we really believe almost all Hadoop jobs benefit
from being part of some sort of workflow. However, in theory, nothing stops you from using the JobTask class (and
also HdfsTarget) without using the rest of Luigi. You can simply run it manually using
MyJobTask('abc', 123).run()
You can use the hdfs.target.HdfsTarget class anywhere by just instantiating it:
t = luigi.contrib.hdfs.target.HdfsTarget('/tmp/test.gz', format=format.Gzip)
f = t.open('w')
# ...
f.close() # needed
The scheduler decides which task to run next from the set of all tasks that have all their dependencies met. By default,
this choice is pretty arbitrary, which is fine for most workflows and situations.
If you want to have some control on the order of execution of available tasks, you can set the priority property of
a task, for example as follows:
Tasks with a higher priority value will be picked before tasks with a lower priority value. There is no predefined range
of priorities, you can choose whatever (int or float) values you want to use. The default value is 0.
Warning: task execution order in Luigi is influenced by both dependencies and priorities, but in Luigi dependencies
come first. For example: if there is a task A with priority 1000 but still with unmet dependencies and a task B with
priority 1 without any pending dependencies, task B will be picked first.
In order to avoid name clashes and to be able to have an identifier for tasks, Luigi introduces the concepts
task_namespace, task_family and task_id. The namespace and family operate on class level meanwhile the task id
only exists on instance level. The concepts are best illustrated using code.
import luigi
class MyTask(luigi.Task):
my_param = luigi.Parameter()
task_namespace = 'my_namespace'
my_task = MyTask(my_param='hello')
print(my_task) # --> my_namespace.MyTask(my_param=hello)
The full documentation for this machinery exists in the task module.
In addition to the stuff mentioned above, Luigi also does some metaclass logic so that if e.g.
DailyReport(datetime.date(2012, 5, 10)) is instantiated twice in the code, it will in fact result in
the same object. See Instance caching for more info
8.4 Parameters
Parameters is the Luigi equivalent of creating a constructor for each Task. Luigi requires you to declare these parame-
ters by instantiating Parameter objects on the class scope:
class DailyReport(luigi.contrib.hadoop.JobTask):
date = luigi.DateParameter(default=datetime.date.today())
# ...
By doing this, Luigi can take care of all the boilerplate code that would normally be needed in the constructor. Inter-
nally, the DailyReport object can now be constructed by running DailyReport(datetime.date(2012, 5,
10)) or just DailyReport(). Luigi also creates a command line parser that automatically handles the conver-
sion from strings to Python types. This way you can invoke the job on the command line eg. by passing --date
2012-05-10.
The parameters are all set to their values on the Task object instance, i.e.
8.4. Parameters 33
Luigi Documentation, Release 2.8.13
d = DailyReport(datetime.date(2012, 5, 10))
print(d.date)
will return the same date that the object was constructed with. Same goes if you invoke Luigi on the command line.
Tasks are uniquely identified by their class name and values of their parameters. In fact, within the same worker, two
tasks of the same class with parameters of the same values are not just equal, but the same instance:
If a parameter is created with significant=False, it is ignored as far as the Task signature is concerned. Tasks
created with only insignificant parameters differing have the same signature but are not the same instance:
Using ParameterVisibility you can configure parameter visibility. By default, all parameters are public, but
you can also set them hidden or private.
>>> luigi.Parameter(visibility=ParameterVisibility.PRIVATE)
In the examples above, the type of the parameter is determined by using different subclasses of Parameter. There
are a few of them, like DateParameter, DateIntervalParameter, IntParameter, FloatParameter,
etc.
Python is not a statically typed language and you don’t have to specify the types of any of your parameters. You can
simply use the base class Parameter if you don’t care.
The reason you would use a subclass like DateParameter is that Luigi needs to know its type for the command
line interaction. That’s how it knows how to convert a string provided on the command line to the corresponding type
(i.e. datetime.date instead of a string).
All parameters are also exposed on a class level on the command line interface. For instance, say you have classes
TaskA and TaskB:
class TaskA(luigi.Task):
x = luigi.Parameter()
class TaskB(luigi.Task):
y = luigi.Parameter()
You can run TaskB on the command line: luigi TaskB --y 42. But you can also set the class value of TaskA
by running luigi TaskB --y 42 --TaskA-x 43. This sets the value of TaskA.x to 43 on a class level. It
is still possible to override it inside Python if you instantiate TaskA(x=44).
All parameters can also be set from the configuration file. For instance, you can put this in the config:
[TaskA]
x: 45
Just as in the previous case, this will set the value of TaskA.x to 45 on the class level. And likewise, it is still possible
to override it inside Python if you instantiate TaskA(x=44).
8.4. Parameters 35
Luigi Documentation, Release 2.8.13
1. Any value passed to the constructor, or task level value set on the command line (applies on an instance level)
2. Any value set on the command line (applies on a class level)
3. Any configuration option (applies on a class level)
4. Any default value provided to the parameter (applies on a class level)
See the Parameter class for more information.
The preferred way to run Luigi tasks is through the luigi command line tool that will be installed with the pip
package.
class MyTask(luigi.Task):
x = luigi.IntParameter()
y = luigi.IntParameter(default=45)
def run(self):
print(self.x + self.y)
Note that if a parameter name contains ‘_’, it should be replaced by ‘-‘. For example, if MyTask had a parameter called
‘my_parameter’:
Note: Please make sure to always place task parameters behind the task family!
class MyTask1(luigi.Task):
x = luigi.IntParameter()
y = luigi.IntParameter(default=0)
def run(self):
print(self.x + self.y)
class MyTask2(luigi.Task):
x = luigi.IntParameter()
y = luigi.IntParameter(default=1)
z = luigi.IntParameter(default=2)
def run(self):
print(self.x * self.y * self.z)
if __name__ == '__main__':
luigi.build([MyTask1(x=10), MyTask2(x=15, z=3)])
Also, it is possible to pass additional parameters to build such as host, port, workers and local_scheduler:
if __name__ == '__main__':
luigi.build([MyTask1(x=1)], workers=5, local_scheduler=True)
To achieve some special requirements you can pass to build your worker_scheduler_factory which will
return your worker and/or scheduler implementations:
class MyWorker(Worker):
# some custom logic
class MyFactory(object):
def create_local_scheduler(self):
return scheduler.Scheduler(prune_on_get_work=True, record_task_history=False)
if __name__ == '__main__':
luigi.build([MyTask1(x=1)], worker_scheduler_factory=MyFactory())
• Default response By default luigi.build()/luigi.run() returns True if there were no scheduling errors. This is the
same as the attribute LuigiRunResult.scheduling_succeeded.
• Detailed response This is a response of type LuigiRunResult. This is obtained by passing a keyword
argument detailed_summary=True to build/run. This response contains detailed information about the
jobs.
if __name__ == '__main__':
luigi_run_result = luigi.build(..., detailed_summary=True)
print(luigi_run_result.summary_text)
While the --local-scheduler flag is useful for development purposes, it’s not recommended for production
usage. The centralized scheduler serves two purposes:
• Make sure two instances of the same task are not running simultaneously
• Provide visualization of everything that’s going on.
Note that the central scheduler does not execute anything for you or help you with job parallelization. For running
tasks periodically, the easiest thing to do is to trigger a Python script from cron or from a continuously running process.
There is no central process that automatically triggers jobs. This model may seem limited, but we believe that it makes
things far more intuitive and easy to understand.
Note that this requires python-daemon. By default, the server starts on AF_INET and AF_INET6 port 8082
(which can be changed with the --port flag) and listens on all IPs. (To use an AF_UNIX socket use the
--unix-socket flag)
For a full list of configuration options and defaults, see the scheduler configuration section. Note that luigid uses
the same configuration files as the Luigi client (i.e. luigi.cfg or /etc/luigi/client.cfg by default).
Task History is an experimental feature in which additional information about tasks that have been executed are
recorded in a relational database for historical analysis. This information is exposed via the Central Scheduler at
/history.
To enable the task history, specify record_task_history = True in the [scheduler] section of luigi.
cfg and specify db_connection under [task_history]. The db_connection string is used to configure
the SQLAlchemy engine. When starting up, luigid will create all the necessary tables using create_all.
Example configuration
[scheduler]
record_task_history = True
state_path = /usr/local/var/luigi-state.pickle
[task_history]
db_connection = sqlite:////usr/local/var/luigi-task-hist.db
• /history/by_id/:id detailed information about a run, including: parameter values, the host on which it
ran, and timing information. Example screenshot:
• /history/by_name/:name a listing of all runs of a task with the given task name. Example screenshot:
The most important aspect is that no execution is transferred. When you run a Luigi workflow, the worker schedules
all tasks, and also executes the tasks within the process.
The benefit of this scheme is that it’s super easy to debug since all execution takes place in the process. It also makes
deployment a non-event. During development, you typically run the Luigi workflow from the command line, whereas
when you deploy it, you can trigger it using crontab or any other scheduler.
The downside is that Luigi doesn’t give you scalability for free. In practice this is not a problem until you start running
thousands of tasks.
Isn’t the point of Luigi to automate and schedule these workflows? To some extent. Luigi helps you encode the
dependencies of tasks and build up chains. Furthermore, Luigi’s scheduler makes sure that there’s a centralized view
of the dependency graph and that the same job will not be executed by multiple workers simultaneously.
8.7.2 Scheduler
A client only starts the run() method of a task when the single-threaded central scheduler has permitted it. Since the
number of tasks is usually very small (in comparison with the petabytes of data one task is processing), we can afford
the convenience of a simple centralised server.
The gif is from this presentation, which is about the client and server interaction.
Luigi does not include its own triggering, so you have to rely on an external scheduler such as crontab to actually
trigger the workflows.
In practice, it’s not a big hurdle because Luigi avoids all the mess typically caused by it. Scheduling a complex
workflow is fairly trivial using eg. crontab.
In the future, Luigi might implement its own triggering. The dependency on crontab (or any external triggering
mechanism) is a bit awkward and it would be nice to avoid.
Trigger example
For instance, if you have an external data dump that arrives every day and that your workflow depends on it, you write
a workflow that depends on this data dump. Crontab can then trigger this workflow every minute to check if the data
has arrived. If it has, it will run the full dependency graph.
# my_tasks.py
class DataDump(luigi.ExternalTask):
date = luigi.DateParameter()
def output(self): return luigi.contrib.hdfs.HdfsTarget(self.date.strftime('/var/
˓→log/dump/%Y-%m-%d.txt'))
class AggregationTask(luigi.Task):
date = luigi.DateParameter()
window = luigi.IntParameter()
def requires(self): return [DataDump(self.date - datetime.timedelta(i)) for i in
˓→xrange(self.window)]
class RunAll(luigi.Task):
''' Dummy task that triggers execution of a other tasks'''
def requires(self):
for window in [3, 7, 14]:
for d in xrange(10): # guarantee that aggregations were run for the past
˓→10 days
You can trigger this as much as you want from crontab, and even across multiple machines, because the cen-
tral scheduler will make sure at most one of each AggregationTask task is run simultaneously. Note that
this might actually mean multiple tasks can be run because there are instances with different parameters, and this
can give you some form of parallelization (eg. AggregationTask(2013-01-09) might run in parallel with
AggregationTask(2013-01-08)).
Of course, some Task types (eg. HadoopJobTask) can transfer execution to other places, but this is up to each Task
to define.
One nice thing about Luigi is that it’s super easy to depend on tasks defined in other repos. It’s also trivial to have
“forks” in the execution path, where the output of one task may become the input of many other tasks.
Currently, no semantics for “intermediate” output is supported, meaning that all output will be persisted indefinitely.
The upside of that is that if you try to run X -> Y, and Y crashes, you can resume with the previously built X. The
downside is that you will have a lot of intermediate results on your file system. A useful pattern is to put these files in
a special directory and have some kind of periodical garbage collection clean it up.
A convenient pattern is to have a dummy Task at the end of several dependency chains, so you can trigger a multitude
of pipelines by specifying just one task in command line, similarly to how e.g. make works.
class AllReports(luigi.WrapperTask):
date = luigi.DateParameter(default=datetime.date.today())
def requires(self):
yield SomeReport(self.date)
yield SomeOtherReport(self.date)
yield CropReport(self.date)
yield TPSReport(self.date)
yield FooBarBazReport(self.date)
This simple task will not do anything itself, but will invoke a bunch of other tasks. Per each invocation, Luigi will
perform as many of the pending jobs as possible (those which have all their dependencies present).
You’ll need to use WrapperTask for this instead of the usual Task class, because this job will not produce any output
of its own, and as such needs a way to indicate when it’s complete. This class is used for tasks that only wrap other
tasks and that by definition are done if all their requirements exist.
A common requirement is to have a daily report (or something else) produced every night. Sometimes for various
reasons tasks will keep crashing or lacking their required dependencies for more than a day though, which would lead
to a missing deliverable for some date. Oops.
To ensure that the above AllReports task is eventually completed for every day (value of date parameter), one could
e.g. add a loop in requires method to yield dependencies on the past few days preceding self.date. Then, so long as
Luigi keeps being invoked, the backlog of jobs would catch up nicely after fixing intermittent problems.
Luigi actually comes with a reusable tool for achieving this, called RangeDailyBase (resp. RangeHourlyBase).
Simply putting
in your crontab will easily keep gaps from occurring from 2015-01-01 onwards. NB - it will not always loop
over everything from 2015-01-01 till current time though, but rather a maximum of 3 months ago by default - see
RangeDailyBase documentation for this and more knobs for tweaking behavior. See also Monitoring below.
RangeDailyBase, described above, is named like that because a more efficient subclass exists, RangeDaily (resp.
RangeHourly), tailored for hundreds of task classes scheduled concurrently with contiguousness requirements
spanning years (which would incur redundant completeness checks and scheduler overload using the naive looping
approach.) Usage:
It has the same knobs as RangeDailyBase, with some added requirements. Namely the task must implement an
efficient bulk_complete method, or must be writing output to file system Target with date parameter value consistently
represented in the file path.
Also a common use case, sometimes you have tweaked existing recurring task code and you want to schedule recom-
putation of it over an interval of dates for that or another reason. Most conveniently it is achieved with the above
described range tools, just with both start (inclusive) and stop (exclusive) parameters specified:
Some tasks you want to recur may include additional parameters which need to be configured. The Range classes
provide a parameter which accepts a DictParameter and passes any parameters onwards for this purpose.
Alternatively, you can specify parameters at the task family level (as described here), however these will not appear
in the task name for the upstream Range task which can have implications in how the scheduler and visualizer handle
task instances.
Sometimes it’ll be faster to run multiple jobs together as a single batch rather than running them each individually.
When this is the case, you can mark some parameters with a batch_method in their constructor to tell the worker how
to combine multiple values. One common way to do this is by simply running the maximum value. This is good for
tasks that overwrite older data when a newer one runs. You accomplish this by setting the batch_method to max, like
so:
class A(luigi.Task):
date = luigi.DateParameter(batch_method=max)
What’s exciting about this is that if you send multiple As to the scheduler, it can combine them and re-
turn one. So if A(date=2016-07-28), A(date=2016-07-29) and A(date=2016-07-30) are all
ready to run, you will start running A(date=2016-07-30). While this is running, the scheduler will show
A(date=2016-07-28), A(date=2016-07-29) as batch running while A(date=2016-07-30) is running.
When A(date=2016-07-30) is done running and becomes FAILED or DONE, the other two tasks will be updated
to the same status.
If you want to limit how big a batch can get, simply set max_batch_size. So if you have
class A(luigi.Task):
date = luigi.DateParameter(batch_method=max)
max_batch_size = 10
then the scheduler will batch at most 10 jobs together. You probably do not want to do this with the max batch method,
but it can be helpful if you use other methods. You can use any method that takes a list of parameter values and returns
a single parameter value.
If you have two max batch parameters, you’ll get the max values for both of them. If you have parameters that don’t
have a batch method, they’ll be aggregated separately. So if you have a class like
class A(luigi.Task):
p1 = luigi.IntParameter(batch_method=max)
p2 = luigi.IntParameter(batch_method=max)
p3 = luigi.IntParameter()
and you create tasks A(p1=1, p2=2, p3=0), A(p1=2, p2=3, p3=0), A(p1=3, p2=4, p3=1), you’ll
get them batched as A(p1=2, p2=3, p3=0) and A(p1=3, p2=4, p3=1).
Note that batched tasks do not take up [resources], only the task that ends up running will use resources. The scheduler
only checks that there are sufficient resources for each task individually before batching them all together.
If you are overwriting of the same data source with every run, you’ll need to ensure that two batches can’t run at the
same time. You can do this pretty easily by setting batch_method to max and setting a unique resource:
class A(luigi.Task):
date = luigi.DateParameter(batch_method=max)
resources = {'overwrite_resource': 1}
Updating a single file from several tasks is almost always a bad idea, and you need to be very confident that no other
good solution exists before doing this. If, however, you have no other option, then you will probably at least need to
ensure that no two tasks try to write to the file _simultaneously_.
By turning ‘resources’ into a Python property, it can return a value dependent on the task parameters or other dynamic
attributes:
class A(luigi.Task):
...
@property
def resources(self):
return { self.important_file_name: 1 }
Since, by default, resources have a usage limit of 1, no two instances of Task A will now run if they have the same
important_file_name property.
At scheduling time, the luigi scheduler needs to be aware of the maximum resource consumption a task might have
once it runs. For some tasks, however, it can be beneficial to decrease the amount of consumed resources between
two steps within their run method (e.g. after some heavy computation). In this case, a different task waiting for that
particular resource can already be scheduled.
class A(luigi.Task):
def run(self):
# do something
...
Luigi comes with some existing ways in luigi.notifications to receive notifications whenever tasks crash.
Email is the most common way.
The above mentioned range tools for recurring tasks not only implement reliable scheduling for you, but also emit
events which you can use to set up delay monitoring. That way you can implement alerts for when jobs are stuck for
prolonged periods lacking input data or otherwise requiring attention.
A very common mistake done by luigi plumbers is to write data partially to the final destination, that is, not atomically.
The problem arises because completion checks in luigi are exactly as naive as running luigi.target.Target.
exists(). And in many cases it just means to check if a folder exist on disk. During the time we have partially
written data, a task depending on that output would think its input is complete. This can have devestating effects, as in
the thanksgiving bug.
The concept can be illustrated by imagining that we deal with data stored on local disk and by running commands:
As stated earlier, the problem is that only partial data exists for a duration, yet we consider the data to be complete()
because the output folder already exists. Here is a robust version of this:
Indeed, the good way is not as trivial. It involves coming up with a unique directory name and a pretty complex mv
line, the reason mv need all those is because we don’t want mv to move a directory into a potentially existing directory.
A directory could already exist in exceptional cases, for example when central locking fails and the same task would
somehow run twice at the same time. Lastly, in the exceptional case where the file was never moved, one might want
to remove the temporary directory that never got used.
Note that this was an example where the storage was on local disk. But for every storage (hard disk file, hdfs file,
database table, etc.) this procedure will look different. But do every luigi user need to implement that complexity?
Nope, thankfully luigi developers are aware of these and luigi comes with many built-in solutions. In the case of
you’re dealing with a file system (FileSystemTarget), you should consider using temporary_path(). For
other targets, you should ensure that the way you’re writing your final output directory is atomic.
The central scheduler is able to send messages to particular tasks. When a running task accepts messages, it can access
a multiprocessing.Queue object storing incoming messages. You can implement custom behavior to react and respond
to messages:
class Example(luigi.Task):
def run(self):
# this example runs some loop and listens for the
# "terminate" message, and responds to all other messages
for _ in some_loop():
# check incomming messages
if not self.scheduler_messages.empty():
msg = self.scheduler_messages.get()
if msg.content == "terminate":
break
else:
msg.respond("unknown message")
# finalize
...
Messages can be sent right from the scheduler UI which also displays responses (if any). Note that this feature is only
available when the scheduler is configured to send messages (see the [scheduler] config), and the task is configured to
accept them.
8.9 Configuration
[hadoop]
version=cdh4
streaming_jar=/usr/lib/hadoop-xyz/hadoop-streaming-xyz-123.jar
[core]
scheduler_host=luigi-host.mycompany.foo
[hadoop]
version = "cdh4"
streaming_jar = "/usr/lib/hadoop-xyz/hadoop-streaming-xyz-123.jar"
[core]
scheduler_host = "luigi-host.mycompany.foo"
All parameters can be overridden from configuration files. For instance if you have a Task definition:
class DailyReport(luigi.contrib.hadoop.JobTask):
date = luigi.DateParameter(default=datetime.date.today())
# ...
Then you can override the default value for DailyReport().date by providing it in the configuration:
[DailyReport]
date=2012-01-01
Configuration classes
Using the Parameters from config Ingestion method, we derive the conventional way to do global configuration. Imag-
ine this configuration.
[mysection]
option=hello
intoption=123
import luigi
mysection().option
mysection().intoption
Luigi comes with a lot of configurable options. Below, we describe each section and the parameters available within
it.
8.9.3 [core]
These parameters control core Luigi behavior, such as error e-mails and interactions between the worker and scheduler.
autoload_range New in version 2.8.11.
If false, prevents range tasks from autoloading. They can still be loaded using --module luigi.tools.
range. Defaults to true. Setting this to true explicitly disables the deprecation warning.
default_scheduler_host Hostname of the machine running the scheduler. Defaults to localhost.
default_scheduler_port Port of the remote scheduler api process. Defaults to 8082.
default_scheduler_url Full path to remote scheduler. Defaults to http://localhost:8082/. For TLS support
use the URL scheme: https, example: https://luigi.example.com:443/ (Note: you will have to
8.9. Configuration 49
Luigi Documentation, Release 2.8.13
terminate TLS using an HTTP proxy) You can also use this to connect to a local Unix socket using the non-
standard URI scheme: http+unix example: http+unix://%2Fvar%2Frun%2Fluigid%2Fluigid.
sock/
hdfs_tmp_dir Base directory in which to store temporary files on hdfs. Defaults to tempfile.gettempdir()
history_filename If set, specifies a filename for Luigi to write stuff (currently just job id) to in mapreduce job’s output
directory. Useful in a configuration where no history is stored in the output directory by Hadoop.
log_level The default log level to use when no logging_conf_file is set. Must be a valid name of a Python log level.
Default is DEBUG.
logging_conf_file Location of the logging configuration file.
max_shown_tasks New in version 1.0.20.
The maximum number of tasks returned in a task_list api call. This will restrict the number of tasks shown in
task lists in the visualiser. Small values can alleviate frozen browsers when there are too many done tasks. This
defaults to 100000 (one hundred thousand).
max_graph_nodes New in version 2.0.0.
The maximum number of nodes returned by a dep_graph or inverse_dep_graph api call. Small values can greatly
speed up graph display in the visualiser by limiting the number of nodes shown. Some of the nodes that are
not sent to the visualiser will still show up as dependencies of nodes that were sent. These nodes are given
TRUNCATED status.
no_configure_logging If true, logging is not configured. Defaults to false.
parallel_scheduling If true, the scheduler will compute complete functions of tasks in parallel using multiprocessing.
This can significantly speed up scheduling, but requires that all tasks can be pickled. Defaults to false.
parallel_scheduling_processes The number of processes to use for parallel scheduling. If not specified the default
number of processes will be the total number of CPUs available.
rpc_connect_timeout Number of seconds to wait before timing out when making an API call. Defaults to 10.0
rpc_retry_attempts The maximum number of retries to connect the central scheduler before giving up. Defaults to 3
rpc_retry_wait Number of seconds to wait before the next attempt will be started to connect to the central scheduler
between two retry attempts. Defaults to 30
8.9.4 [cors]
8.9.5 [worker]
8.9. Configuration 51
Luigi Documentation, Release 2.8.13
send_failure_email Controls whether the worker will send e-mails on task and scheduling failures. If set to false,
workers will only send e-mails on framework errors during scheduling and all other e-mail must be handled by
the scheduler. Defaults to true.
check_unfulfilled_deps If true, the worker checks for completeness of dependencies before running a task. In case
unfulfilled dependencies are detected, an exception is raised and the task will not run. This mechanism is useful
to detect situations where tasks do not create their outputs properly, or when targets were removed after the
dependency tree was built. It is recommended to disable this feature only when the completeness checks are
known to be bottlenecks, e.g. when the exists() calls of the dependencies’ outputs are resource-intensive.
Defaults to true.
force_multiprocessing By default, luigi uses multiprocessing when more than one worker process is requested. When
set to true, multiprocessing is used independent of the the number of workers. Defaults to false.
8.9.6 [elasticsearch]
8.9.7 [email]
General parameters
force_send If true, e-mails are sent in all run configurations (even if stdout is connected to a tty device). Defaults to
False.
format Type of e-mail to send. Valid values are “plain”, “html” and “none”. When set to html, tracebacks are wrapped
in <pre> tags to get fixed- width font. When set to none, no e-mails will be sent.
Default value is plain.
method Valid values are “smtp”, “sendgrid”, “ses” and “sns”. SES and SNS are services of Amazon web services.
SendGrid is an email delivery service. The default value is “smtp”.
In order to send messages through Amazon SNS or SES set up your AWS config files or run Luigi on an EC2
instance with proper instance profile.
In order to use sendgrid, fill in your sendgrid API key in the [sendgrid] section.
In order to use smtp, fill in the appropriate fields in the [smtp] section.
prefix Optional prefix to add to the subject line of all e-mails. For example, setting this to “[LUIGI]” would change
the subject line of an e-mail from “Luigi: Framework error” to “[LUIGI] Luigi: Framework error”
receiver Recipient of all error e-mails. If this is not set, no error e-mails are sent when Luigi crashes unless the
crashed job has owners set. If Luigi is run from the command line, no e-mails will be sent unless output is
redirected to a file.
Set it to SNS Topic ARN if you want to receive notifications through Amazon SNS. Make sure to set method to
sns in this case too.
sender User name in from field of error e-mails. Default value: luigi-client@<server_name>
8.9.8 [batch_notifier]
Parameters controlling the contents of batch notifications sent from the scheduler
email_interval Number of minutes between e-mail sends. Making this larger results in fewer, bigger e-mails. De-
faults to 60.
batch_mode Controls how tasks are grouped together in the e-mail. Suppose we have the following sequence of
failures:
1. TaskA(a=1, b=1)
2. TaskA(a=1, b=1)
3. TaskA(a=2, b=1)
4. TaskA(a=1, b=2)
5. TaskB(a=1, b=1)
For any setting of batch_mode, the batch e-mail will record 5 failures and mention them in the subject. The
difference is in how they will be displayed in the body. Here are example bodies with error_messages set to 0.
“all” only groups together failures for the exact same task:
• TaskA(a=1, b=1) (2 failures)
• TaskA(a=1, b=2) (1 failure)
• TaskA(a=2, b=1) (1 failure)
• TaskB(a=1, b=1) (1 failure)
“family” groups together failures for tasks of the same family:
• TaskA (4 failures)
• TaskB (1 failure)
“unbatched_params” groups together tasks that look the same after removing batched parameters. So if TaskA
has a batch_method set for parameter a, we get the following:
• TaskA(b=1) (3 failures)
• TaskA(b=2) (1 failure)
• TaskB(a=1, b=2) (1 failure)
Defaults to “unbatched_params”, which is identical to “all” if you are not using batched parameters.
error_lines Number of lines to include from each error message in the batch e-mail. This can be used to keep e-mails
shorter while preserving the more useful information usually found near the bottom of stack traces. This can
be set to 0 to include all lines. If you don’t wish to see error messages, instead set error_messages to 0.
Defaults to 20.
error_messages Number of messages to preserve for each task group. As most tasks that fail repeatedly do so for
similar reasons each time, it’s not usually necessary to keep every message. This controls how many messages
are kept for each task or task group. The most recent error messages are kept. Set to 0 to not include error
messages in the e-mails. Defaults to 1.
group_by_error_messages Quite often, a system or cluster failure will cause many disparate task types to fail for the
same reason. This can cause a lot of noise in the batch e-mails. This cuts down on the noise by listing items with
identical error messages together. Error messages are compared after limiting by error_lines. Defaults to
true.
8.9. Configuration 53
Luigi Documentation, Release 2.8.13
8.9.9 [hadoop]
8.9.10 [hdfs]
8.9.11 [hive]
8.9.12 [kubernetes]
8.9.13 [mysql]
8.9.14 [postgres]
8.9.15 [redshift]
8.9.16 [resources]
This section can contain arbitrary keys. Each of these specifies the amount of a global resource that the scheduler can
allow workers to use. The scheduler will prevent running jobs with resources specified from exceeding the counts in
this section. Unspecified resources are assumed to have limit 1. Example resources section for a configuration with 2
hive resources and 1 mysql resource:
[resources]
hive=2
mysql=1
Note that it was not necessary to specify the 1 for mysql here, but it is good practice to do so when you have a fixed
set of resources.
8.9.17 [retcode]
Configure return codes for the Luigi binary. In the case of multiple return codes that could apply, for example a failing
task and missing data, the numerically greatest return code is returned.
We recommend that you copy this set of exit codes to your luigi.cfg file:
[retcode]
# The following return codes are the recommended exit codes for Luigi
# They are in increasing level of severity (for most applications)
already_running=10
missing_data=20
(continues on next page)
8.9. Configuration 55
Luigi Documentation, Release 2.8.13
already_running This can happen in two different cases. Either the local lock file was taken at the time the invocation
starts up. Or, the central scheduler have reported that some tasks could not have been run, because other workers
are already running the tasks.
missing_data For when an ExternalTask is not complete, and this caused the worker to give up. As an alternative
to fiddling with this, see the [worker] keep_alive option.
not_run For when a task is not granted run permission by the scheduler. Typically because of lack of resources,
because the task has been already run by another worker or because the attempted task is in DISABLED state.
Connectivity issues with the central scheduler might also cause this. This does not include the cases for which a
run is not allowed due to missing dependencies (missing_data) or due to the fact that another worker is currently
running the task (already_running).
task_failed For signaling that there were last known to have failed. Typically because some exception have been
raised.
scheduling_error For when a task’s complete() or requires() method fails with an exception, or when the
limit number of tasks is reached.
unhandled_exception For internal Luigi errors. Defaults to 4, since this type of error probably will not recover over
time.
If you customize return codes, prefer to set them in range 128 to 255 to avoid conflicts. Return codes in range 0 to 127
are reserved for possible future use by Luigi contributors.
8.9.18 [scalding]
8.9.19 [scheduler]
retry_count Number of times a task can fail within disable_window_seconds before the scheduler will auto-
matically disable it. If not set, the scheduler will not automatically disable jobs.
disable_persist_seconds Number of seconds for which an automatic scheduler disable lasts. Defaults to 86400 (1
day).
disable_window_seconds Number of seconds during which retry_count failures must occur in order for an
automatic disable by the scheduler. The scheduler forgets about disables that have occurred longer ago than this
amount of time. Defaults to 3600 (1 hour).
record_task_history If true, stores task history in a database. Defaults to false.
remove_delay Number of seconds to wait before removing a task that has no stakeholders. Defaults to 600 (10
minutes).
retry_delay Number of seconds to wait after a task failure to mark it pending again. Defaults to 900 (15 minutes).
state_path Path in which to store the Luigi scheduler’s state. When the scheduler is shut down, its state is stored in
this path. The scheduler must be shut down cleanly for this to work, usually with a kill command. If the kill
command includes the -9 flag, the scheduler will not be able to save its state. When the scheduler is started, it
will load the state from this path if it exists. This will restore all scheduled jobs and other state from when the
scheduler last shut down.
Sometimes this path must be deleted when restarting the scheduler after upgrading Luigi, as old state files can
become incompatible with the new scheduler. When this happens, all workers should be restarted after the
scheduler both to become compatible with the updated code and to reschedule the jobs that the scheduler has
now forgotten about.
This defaults to /var/lib/luigi-server/state.pickle
worker_disconnect_delay Number of seconds to wait after a worker has stopped pinging the scheduler before re-
moving it and marking all of its running tasks as failed. Defaults to 60.
pause_enabled If false, disables pause/unpause operations and hides the pause toggle from the visualiser.
send_messages When true, the scheduler is allowed to send messages to running tasks and the central scheduler
provides a simple prompt per task to send messages. Defaults to true.
metrics_collector Optional setting allowing Luigi to use a contribution to collect metrics about the pipeline to a
third-party. By default this uses the default metric collector that acts as a shell and does nothing. The currently
available options are “datadog” and “prometheus”.
8.9.20 [sendgrid]
8.9.21 [smtp]
8.9. Configuration 57
Luigi Documentation, Release 2.8.13
8.9.22 [spark]
8.9.23 [task_history]
8.9.24 [execution_summary]
8.9.25 [webhdfs]
port The port to use for webhdfs. The normal namenode port is probably on a different port from this one.
user Perform file system operations as the specified user instead of $USER. Since this parameter is not honored by
any of the other hdfs clients, you should think twice before setting this parameter.
client_type The type of client to use. Default is the “insecure” client that requires no authentication. The other option
is the “kerberos” client that uses kerberos authentication.
8.9.26 [datadog]
api_key The api key found in the account settings of Datadog under the API sections.
app_key The application key found in the account settings of Datadog under the API sections.
default_tags Optional settings that adds the tag to all the metrics and events sent to Datadog. Default value is “appli-
cation:luigi”.
environment Allows you to tweak multiple environment to differentiate between production, staging or development
metrics within Datadog. Default value is “development”.
statsd_host The host that has the statsd instance to allow Datadog to send statsd metric. Default value is “localhost”.
statsd_port The port on the host that allows connection to the statsd host. Defaults value is 8125.
metric_namespace Optional prefix to add to the beginning of every metric sent to Datadog. Default value is “luigi”.
class GenerateWordsFromHdfs(luigi.Task):
retry_count = 2
8.9. Configuration 59
Luigi Documentation, Release 2.8.13
class GenerateWordsFromRDBM(luigi.Task):
retry_count = 5
...
class CountLetters(luigi.Task):
def requires(self):
return [GenerateWordsFromHdfs()]
def run():
yield GenerateWordsFromRDBM()
...
If none of retry-policy fields is defined per task, the field value will be default value which is defined in luigi config
file.
To make luigi sticks to the given retry-policy, be sure you run luigi worker with keep_alive config. Please check
keep_alive config in [worker] section.
The fields below are in retry-policy and they can be defined per task.
• retry_count
• disable_hard_timeout
• disable_window_seconds
If you’re use TOML for configuration file, you can configure logging via logging section in this file. See example
for more details.
--background Run daemon in background mode. Disable logging setup and set up log level to INFO for root
logger.
--logdir set logging with INFO level and output in $logdir/luigi-server.log file
1. no_configure_logging option
2. --background
3. --logdir
4. --logging-conf-file
5. logging_conf_file option
6. logging section
7. --log-level
8. log_level option
Luigi is the successor to a couple of attempts that we weren’t fully happy with. We learned a lot from our mistakes
and some design decisions include:
• Straightforward command-line integration.
• As little boilerplate as possible.
• Focus on job scheduling and dependency resolution, not a particular platform. In particular, this means no
limitation to Hadoop. Though Hadoop/HDFS support is built-in and is easy to use, this is just one of many types
of things you can run.
• A file system abstraction where code doesn’t have to care about where files are located.
• Atomic file system operations through this abstraction. If a task crashes it won’t lead to a broken state.
• The dependencies are decentralized. No big config file in XML. Each task just specifies which inputs it needs
and cross-module dependencies are trivial.
• A web server that renders the dependency graph and does locking, etc for free.
• Trivial to extend with new file systems, file formats, and job types. You can easily write jobs that inserts a Tokyo
Cabinet into Cassandra. Adding support for new systems is generally not very hard. (Feel free to send us a patch
when you’re done!)
• Date algebra included.
API Reference
9.1.1 Subpackages
luigi.configuration package
Submodules
luigi.configuration.base_parser module
class luigi.configuration.base_parser.BaseParser
63
Luigi Documentation, Release 2.8.13
luigi.configuration.cfg_parser module
luigi.configuration provides some convenience wrappers around Python’s ConfigParser to get configuration options
from config files.
The default location for configuration files is luigi.cfg (or client.cfg) in the current working directory, then
/etc/luigi/client.cfg.
Configuration has largely been superseded by parameters since they can do essentially everything configuration can
do, plus a tighter integration with the rest of Luigi.
See Configuration for more info.
exception luigi.configuration.cfg_parser.InterpolationMissingEnvvarError(option,
sec-
tion,
value,
envvar)
Bases: ConfigParser.InterpolationError
Raised when option value refers to a nonexisting environment variable.
class luigi.configuration.cfg_parser.EnvironmentInterpolation
Bases: object
Custom interpolation which allows values to refer to environment variables using the ${ENVVAR} syntax.
before_get(parser, section, option, value, defaults)
class luigi.configuration.cfg_parser.CombinedInterpolation(interpolations)
Bases: object
Custom interpolation which applies multiple interpolations in series.
Parameters interpolations – a sequence of configparser.Interpolation objects.
before_get(parser, section, option, value, defaults)
before_read(parser, section, option, value)
before_set(parser, section, option, value)
before_write(parser, section, option, value)
class luigi.configuration.cfg_parser.LuigiConfigParser(defaults=None,
dict_type=<class ’col-
lections.OrderedDict’>,
allow_no_value=False)
Bases: luigi.configuration.base_parser.BaseParser, ConfigParser.ConfigParser
NO_DEFAULT = <object object>
enabled = True
classmethod reload()
has_option(section, option)
modified has_option Check for the existence of a given option in a given section. If the specified ‘section’
is None or an empty string, DEFAULT is assumed. If the specified ‘section’ does not exist, returns False.
get(section, option, default=<object object>, **kwargs)
getboolean(section, option, default=<object object>)
getint(section, option, default=<object object>)
luigi.configuration.core module
luigi.configuration.core.get_config(parser=’cfg’)
Get configs singleton for parser
luigi.configuration.core.add_config_path(path)
Select config parser by file extension and add path into parser.
luigi.configuration.toml_parser module
class luigi.configuration.toml_parser.LuigiTomlParser
Bases: luigi.configuration.base_parser.BaseParser
NO_DEFAULT = <object object>
enabled = False
data = {}
read(config_paths)
get(section, option, default=<object object>, **kwargs)
getboolean(section, option, default=<object object>)
getint(section, option, default=<object object>)
getfloat(section, option, default=<object object>)
getintdict(section)
set(section, option, value=None)
has_option(section, option)
Module contents
luigi.configuration.add_config_path(path)
Select config parser by file extension and add path into parser.
luigi.configuration.get_config(parser=’cfg’)
Get configs singleton for parser
class luigi.configuration.LuigiConfigParser(defaults=None, dict_type=<class
’collections.OrderedDict’>, al-
low_no_value=False)
Bases: luigi.configuration.base_parser.BaseParser, ConfigParser.ConfigParser
NO_DEFAULT = <object object>
enabled = True
classmethod reload()
has_option(section, option)
modified has_option Check for the existence of a given option in a given section. If the specified ‘section’
is None or an empty string, DEFAULT is assumed. If the specified ‘section’ does not exist, returns False.
get(section, option, default=<object object>, **kwargs)
getboolean(section, option, default=<object object>)
getint(section, option, default=<object object>)
getfloat(section, option, default=<object object>)
getintdict(section)
set(section, option, value=None)
class luigi.configuration.LuigiTomlParser
Bases: luigi.configuration.base_parser.BaseParser
NO_DEFAULT = <object object>
enabled = False
data = {}
read(config_paths)
get(section, option, default=<object object>, **kwargs)
getboolean(section, option, default=<object object>)
getint(section, option, default=<object object>)
getfloat(section, option, default=<object object>)
getintdict(section)
set(section, option, value=None)
has_option(section, option)
luigi.contrib package
Subpackages
luigi.contrib.hdfs package
Submodules
luigi.contrib.hdfs.abstract_client module
rename_dont_move(path, dest)
Override this method with an implementation that uses rename2, which is a rename operation that never
moves.
rename2 - https://github.com/apache/hadoop/blob/ae91b13/hadoop-hdfs-project/hadoop-hdfs/src/main/
java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java (lines 483-523)
remove(path, recursive=True, skip_trash=False)
Remove file or directory at location path
Parameters
• path (str) – a path within the FileSystem to remove.
• recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to True.
chmod(path, permissions, recursive=False)
chown(path, owner, group, recursive=False)
count(path)
Count contents in a directory
copy(path, destination)
Copy a file or a directory with contents. Currently, LocalFileSystem and MockFileSystem support only
single file copying but S3Client copies either a file or a directory as required.
put(local_path, destination)
get(path, local_destination)
mkdir(path, parents=True, raise_if_exists=False)
Create directory at location path
Creates the directory at path and implicitly create parent directories if they do not already exist.
Parameters
• path (str) – a path within the FileSystem to create as a directory.
• parents (bool) – Create parent directories when necessary. When parents=False and
the parent directory doesn’t exist, raise luigi.target.MissingParentDirectory
• raise_if_exists (bool) – raise luigi.target.FileAlreadyExists if the folder already
exists.
listdir(path, ignore_directories=False, ignore_files=False, include_size=False, include_type=False,
include_time=False, recursive=False)
Return a list of files rooted in path.
This returns an iterable of the files rooted at path. This is intended to be a recursive listing.
Parameters path (str) – a path within the FileSystem to list.
Note: This method is optional, not all FileSystem subclasses implements it.
touchz(path)
luigi.contrib.hdfs.clients module
The implementations of the hdfs clients. The hadoop cli client and the snakebite client.
luigi.contrib.hdfs.clients.get_autoconfig_client(client_cache=<thread._local ob-
ject>)
Creates the client as specified in the luigi.cfg configuration.
luigi.contrib.hdfs.clients.exists(*args, **kwargs)
luigi.contrib.hdfs.clients.rename(*args, **kwargs)
luigi.contrib.hdfs.clients.remove(*args, **kwargs)
luigi.contrib.hdfs.clients.mkdir(*args, **kwargs)
luigi.contrib.hdfs.clients.listdir(*args, **kwargs)
luigi.contrib.hdfs.config module
You can configure what client by setting the “client” config under the “hdfs” section in the configuration, or using the
--hdfs-client command line option. “hadoopcli” is the slowest, but should work out of the box. “snakebite” is
the fastest, but requires Snakebite to be installed.
class luigi.contrib.hdfs.config.hdfs(*args, **kwargs)
Bases: luigi.task.Config
client_version = IntParameter (defaults to None)
effective_user = OptionalParameter (defaults to None): Optionally specifies the effect
snakebite_autoconfig = BoolParameter (defaults to False)
namenode_host = OptionalParameter (defaults to None)
namenode_port = IntParameter (defaults to None)
client = Parameter (defaults to hadoopcli)
tmp_dir = OptionalParameter (defaults to None)
class luigi.contrib.hdfs.config.hadoopcli(*args, **kwargs)
Bases: luigi.task.Config
command = Parameter (defaults to hadoop): The hadoop command, will run split() on it,
version = Parameter (defaults to cdh4): Can also be cdh3 or apache1
luigi.contrib.hdfs.config.load_hadoop_cmd()
luigi.contrib.hdfs.config.get_configured_hadoop_version()
CDH4 (hadoop 2+) has a slightly different syntax for interacting with hdfs via the command line.
The default version is CDH4, but one can override this setting with “cdh3” or “apache1” in the hadoop section
of the config in order to use the old syntax.
luigi.contrib.hdfs.config.get_configured_hdfs_client()
This is a helper that fetches the configuration value for ‘client’ in the [hdfs] section. It will return the client that
retains backwards compatibility when ‘client’ isn’t configured.
luigi.contrib.hdfs.config.tmppath(path=None, include_unix_username=True)
@param path: target path for which it is needed to generate temporary location @type path: str @type in-
clude_unix_username: bool @rtype: str
Note that include_unix_username might work on windows too.
luigi.contrib.hdfs.error module
The implementations of the hdfs clients. The hadoop cli client and the snakebite client.
exception luigi.contrib.hdfs.error.HDFSCliError(command, returncode, stdout, stderr)
Bases: exceptions.Exception
luigi.contrib.hdfs.format module
exception luigi.contrib.hdfs.format.HdfsAtomicWriteError
Bases: exceptions.IOError
class luigi.contrib.hdfs.format.HdfsReadPipe(path)
Bases: luigi.format.InputPipeProcessWrapper
class luigi.contrib.hdfs.format.HdfsAtomicWritePipe(path)
Bases: luigi.format.OutputPipeProcessWrapper
File like object for writing to HDFS
The referenced file is first written to a temporary location and then renamed to final location on close(). If close()
isn’t called the temporary file will be cleaned up when this object is garbage collected
TODO: if this is buggy, change it so it first writes to a local temporary file and then uploads it on completion
abort()
close()
class luigi.contrib.hdfs.format.HdfsAtomicWriteDirPipe(path, data_extension=”)
Bases: luigi.format.OutputPipeProcessWrapper
Writes a data<data_extension> file to a directory at <path>.
abort()
close()
class luigi.contrib.hdfs.format.PlainFormat
Bases: luigi.format.Format
input = 'bytes'
output = 'hdfs'
hdfs_writer(path)
hdfs_reader(path)
pipe_reader(path)
pipe_writer(output_pipe)
class luigi.contrib.hdfs.format.PlainDirFormat
Bases: luigi.format.Format
input = 'bytes'
output = 'hdfs'
hdfs_writer(path)
hdfs_reader(path)
pipe_reader(path)
pipe_writer(path)
class luigi.contrib.hdfs.format.CompatibleHdfsFormat(writer, reader, input=None)
Bases: luigi.format.Format
output = 'hdfs'
pipe_writer(output)
pipe_reader(input)
hdfs_writer(output)
hdfs_reader(input)
luigi.contrib.hdfs.hadoopcli_clients module
The implementations of the hdfs clients. The hadoop cli client and the snakebite client.
luigi.contrib.hdfs.hadoopcli_clients.create_hadoopcli_client()
Given that we want one of the hadoop cli clients (unlike snakebite), this one will return the right one.
class luigi.contrib.hdfs.hadoopcli_clients.HdfsClient
Bases: luigi.contrib.hdfs.abstract_client.HdfsFileSystem
This client uses Apache 2.x syntax for file system commands, which also matched CDH4.
recursive_listdir_cmd = ['-ls', '-R']
static call_check(command)
exists(path)
Use hadoop fs -stat to check file existence.
move(path, dest)
Move a file, as one would expect.
remove(path, recursive=True, skip_trash=False)
Remove file or directory at location path
Parameters
• path (str) – a path within the FileSystem to remove.
• recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to True.
chmod(path, permissions, recursive=False)
chown(path, owner, group, recursive=False)
count(path)
Count contents in a directory
copy(path, destination)
Copy a file or a directory with contents. Currently, LocalFileSystem and MockFileSystem support only
single file copying but S3Client copies either a file or a directory as required.
put(local_path, destination)
get(path, local_destination)
getmerge(path, local_destination, new_line=False)
luigi.contrib.hdfs.snakebite_client module
A hdfs client using snakebite. Since Snakebite has a python API, it’ll be about 100 times faster than the hadoop
cli client, which does shell out to a java program on each file system operation.
static list_path(path)
get_bite()
If Luigi has forked, we have a different PID, and need to reconnect.
exists(path)
Use snakebite.test to check file existence.
Parameters path (string) – path to test
Returns boolean, True if path exists in HDFS
move(path, dest)
Use snakebite.rename, if available.
Parameters
• path (either a string or sequence of strings) – source file(s)
• dest (string) – destination file (single input) or directory (multiple)
Returns list of renamed items
rename_dont_move(path, dest)
Use snakebite.rename_dont_move, if available.
Parameters
• path (string) – source path (single input)
• dest (string) – destination path
Returns True if succeeded
Raises snakebite.errors.FileAlreadyExistsException
remove(path, recursive=True, skip_trash=False)
Use snakebite.delete, if available.
Parameters
• path (either a string or a sequence of strings) – delete-able file(s)
or directory(ies)
• recursive (boolean, default is True) – delete directories trees like *nix: rm
-r
• skip_trash (boolean, default is False (use trash)) – do or don’t
move deleted items into the trash first
Returns list of deleted items
chmod(path, permissions, recursive=False)
Use snakebite.chmod, if available.
Parameters
• path (either a string or sequence of strings) – update-able file(s)
• permissions (octal) – *nix style permission number
• recursive (boolean, default is False) – change just listed entry(ies) or all
in directories
Returns list of all changed items
luigi.contrib.hdfs.target module
copy(dst_dir)
Copy to destination directory.
is_writable()
Currently only works with hadoopcli
class luigi.contrib.hdfs.target.HdfsFlagTarget(path, format=None, client=None,
flag=’_SUCCESS’)
Bases: luigi.contrib.hdfs.target.HdfsTarget
Defines a target directory with a flag-file (defaults to _SUCCESS) used to signify job success.
This checks for two things:
• the path exists (just like the HdfsTarget)
• the _SUCCESS file exists within the directory.
Because Hadoop outputs into a directory and not a single file, the path is assumed to be a directory.
Initializes a HdfsFlagTarget.
Parameters
• path (str) – the directory where the files are stored.
• client –
• flag (str) –
exists()
Returns True if the path for this FileSystemTarget exists; False otherwise.
This method is implemented by using fs.
luigi.contrib.hdfs.webhdfs_client module
A luigi file system client that wraps around the hdfs-library (a webhdfs client)
This is a sensible fast alternative to snakebite. In particular for python3 users, where snakebite is not supported at the
time of writing (dec 2015).
Note. This wrapper client is not feature complete yet. As with most software the authors only implement the features
they need. If you need to wrap more of the file system operations, please do and contribute back.
class luigi.contrib.hdfs.webhdfs_client.webhdfs(*args, **kwargs)
Bases: luigi.task.Config
port = IntParameter (defaults to 50070): Port for webhdfs
user = Parameter (defaults to ): Defaults to $USER envvar
client_type = ChoiceParameter (defaults to insecure): Type of hdfs client to use. Choi
class luigi.contrib.hdfs.webhdfs_client.WebHdfsClient(host=None, port=None,
user=None,
client_type=None)
Bases: luigi.contrib.hdfs.abstract_client.HdfsFileSystem
A webhdfs that tries to confirm to luigis interface for file existence.
The library is using this api.
url
client
walk(path, depth=1)
exists(path)
Returns true if the path exists and false otherwise.
upload(hdfs_path, local_path, overwrite=False)
download(hdfs_path, local_path, overwrite=False, n_threads=-1)
remove(hdfs_path, recursive=True, skip_trash=False)
Remove file or directory at location path
Parameters
• path (str) – a path within the FileSystem to remove.
• recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to True.
read(hdfs_path, offset=0, length=None, buffer_size=None, chunk_size=1024, buffer_char=None)
move(path, dest)
Move a file, as one would expect.
mkdir(path, parents=True, mode=493, raise_if_exists=False)
Has no returnvalue (just like WebHDFS)
chmod(path, permissions, recursive=False)
Raise a NotImplementedError exception.
chown(path, owner, group, recursive=False)
Raise a NotImplementedError exception.
count(path)
Raise a NotImplementedError exception.
copy(path, destination)
Raise a NotImplementedError exception.
put(local_path, destination)
Restricted version of upload
get(path, local_destination)
Restricted version of download
listdir(path, ignore_directories=False, ignore_files=False, include_size=False, include_type=False,
include_time=False, recursive=False)
Return a list of files rooted in path.
This returns an iterable of the files rooted at path. This is intended to be a recursive listing.
Parameters path (str) – a path within the FileSystem to list.
Note: This method is optional, not all FileSystem subclasses implements it.
touchz(path)
To touchz using the web hdfs “write” cmd.
Module contents
Provides access to HDFS using the HdfsTarget, a subclass of Target. You can configure what client by setting the
“client” config under the “hdfs” section in the configuration, or using the --hdfs-client command line option.
“hadoopcli” is the slowest, but should work out of the box. “snakebite” is the fastest, but requires Snakebite to be
installed.
Since the hdfs functionality is quite big in luigi, it’s split into smaller files under luigi/contrib/hdfs/*.py.
But for the sake of convenience and API stability, everything is reexported under luigi.contrib.hdfs.
Submodules
luigi.contrib.azureblob module
luigi.contrib.batch module
luigi.contrib.beam_dataflow module
class luigi.contrib.beam_dataflow.DataflowParamKeys
Bases: object
Defines the naming conventions for Dataflow execution params. For example, the Java API expects param
names in lower camel case, whereas the Python implementation expects snake case.
runner
project
zone
region
staging_location
temp_location
gcp_temp_location
num_workers
autoscaling_algorithm
max_num_workers
disk_size_gb
worker_machine_type
worker_disk_type
job_name
service_account
network
subnetwork
labels
class luigi.contrib.beam_dataflow.BeamDataflowJobTask
Bases: luigi.task.MixinNaiveBulkComplete, luigi.task.Task
Luigi wrapper for a Dataflow job. Must be overridden for each Beam SDK with that SDK’s
dataflow_executable().
For more documentation, see: https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
The following required Dataflow properties must be set:
project # GCP project ID temp_location # Cloud storage path for temporary files
The following optional Dataflow properties can be set:
runner # PipelineRunner implementation for your Beam job. Default: DirectRunner
num_workers # The number of workers to start the task with Default: Determined by Dataflow service
autoscaling_algorithm # The Autoscaling mode for the Dataflow job Default: THROUGHPUT_BASED
max_num_workers # Used if the autoscaling is enabled Default: Determined by Dataflow service
network # Network in GCE to be used for launching workers Default: a network named “default”
subnetwork # Subnetwork in GCE to be used for launching workers Default: Determined by Dataflow ser-
vice
disk_size_gb # Remote worker disk size. Minimum value is 30GB Default: set to 0 to use GCP project de-
fault
worker_machine_type # Machine type to create Dataflow worker VMs Default: Determined by Dataflow
service
job_name # Custom job name, must be unique across project’s active jobs
worker_disk_type # Specify SSD for local disk or defaults to hard disk as a full URL of disk type resource
Default: Determined by Dataflow service.
service_account # Service account of Dataflow VMs/workers Default: active GCE service account
region # Region to deploy Dataflow job to Default: us-central1
zone # Availability zone for launching workers instances Default: an available zone in the specified region
staging_location # Cloud Storage bucket for Dataflow to stage binary files Default: the value of
temp_location
gcp_temp_location # Cloud Storage path for Dataflow to stage temporary files Default: the value of
temp_location
labels # Custom GCP labels attached to the Dataflow job Default: nothing
project = None
runner = None
temp_location = None
staging_location = None
gcp_temp_location = None
num_workers = None
autoscaling_algorithm = None
max_num_workers = None
network = None
subnetwork = None
disk_size_gb = None
worker_machine_type = None
job_name = None
worker_disk_type = None
service_account = None
zone = None
region = None
labels = {}
cmd_line_runner
alias of _CmdLineRunner
dataflow_params = None
dataflow_executable()
Command representing the Dataflow executable to be run. For example:
return [‘java’, ‘com.spotify.luigi.MyClass’, ‘-Xmx256m’]
args()
Extra String arguments that will be passed to your Dataflow job. For example:
return [‘–setup_file=setup.py’]
before_run()
Hook that gets called right before the Dataflow job is launched. Can be used to setup any temporary
files/tables, validate input, etc.
on_successful_run()
Callback that gets called right after the Dataflow job has finished successfully but before validate_output
is run.
validate_output()
Callback that can be used to validate your output before it is moved to its final location. Returning false
here will cause the job to fail, and output to be removed instead of published.
file_pattern()
If one/some of the input target files are not in the pattern of part-, we can add the key of the required target
and the correct file pattern that should be appended in the command line here. If the input target key is not
found in this dict, the file pattern will be assumed to be part- for that target.
:return A dictionary of overridden file pattern that is not part-* for the inputs
on_successful_output_validation()
Callback that gets called after the Dataflow job has finished successfully if validate_output returns True.
cleanup_on_error(error)
Callback that gets called after the Dataflow job has finished unsuccessfully, or validate_output returns
False.
run()
The task run method, to be overridden in a subclass.
See Task.run
static get_target_path(target)
Given a luigi Target, determine a stringly typed path to pass as a Dataflow job argument.
luigi.contrib.bigquery module
class luigi.contrib.bigquery.CreateDisposition
Bases: object
CREATE_IF_NEEDED = 'CREATE_IF_NEEDED'
CREATE_NEVER = 'CREATE_NEVER'
class luigi.contrib.bigquery.WriteDisposition
Bases: object
WRITE_TRUNCATE = 'WRITE_TRUNCATE'
WRITE_APPEND = 'WRITE_APPEND'
WRITE_EMPTY = 'WRITE_EMPTY'
class luigi.contrib.bigquery.QueryMode
Bases: object
INTERACTIVE = 'INTERACTIVE'
BATCH = 'BATCH'
class luigi.contrib.bigquery.SourceFormat
Bases: object
AVRO = 'AVRO'
CSV = 'CSV'
DATASTORE_BACKUP = 'DATASTORE_BACKUP'
NEWLINE_DELIMITED_JSON = 'NEWLINE_DELIMITED_JSON'
class luigi.contrib.bigquery.FieldDelimiter
Bases: object
The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use
a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to
ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state.
BigQuery also supports the escape sequence ” ” to specify a tab separator. The default value is a comma (‘,’).
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load
COMMA = ','
TAB = '\t'
PIPE = '|'
class luigi.contrib.bigquery.PrintHeader
Bases: object
TRUE = True
FALSE = False
class luigi.contrib.bigquery.DestinationFormat
Bases: object
AVRO = 'AVRO'
CSV = 'CSV'
NEWLINE_DELIMITED_JSON = 'NEWLINE_DELIMITED_JSON'
class luigi.contrib.bigquery.Compression
Bases: object
GZIP = 'GZIP'
NONE = 'NONE'
class luigi.contrib.bigquery.Encoding
Bases: object
[Optional] The character encoding of the data. The supported values are UTF-8 or ISO-8859-1. The default
value is UTF-8.
BigQuery decodes the data after the raw, binary data has been split using the values of the quote and fieldDe-
limiter properties.
UTF_8 = 'UTF-8'
ISO_8859_1 = 'ISO-8859-1'
class luigi.contrib.bigquery.BQDataset(project_id, dataset_id, location)
Bases: tuple
Create new instance of BQDataset(project_id, dataset_id, location)
dataset_id
Alias for field number 1
location
Alias for field number 2
project_id
Alias for field number 0
class luigi.contrib.bigquery.BQTable
Bases: luigi.contrib.bigquery.BQTable
Create new instance of BQTable(project_id, dataset_id, table_id, location)
dataset
uri
If the output table exists, it is replaced with the supplied view query. Otherwise a new table is created with
this view.
Parameters
• table (BQTable) – The table to contain the view.
• view (str) – The SQL query for the view.
run_job(project_id, body, dataset=None)
Runs a BigQuery “job”. See the documentation for the format of body.
Note: You probably don’t need to use this directly. Use the tasks defined below.
encoding
The encoding of the data that is going to be loaded (see Encoding).
write_disposition
What to do if the table already exists. By default this will fail the job.
See WriteDisposition
schema
Schema in the format defined at https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.
load.schema.
If the value is falsy, it is omitted and inferred by BigQuery.
max_bad_records
The maximum number of bad records that BigQuery can ignore when reading data.
If the number of bad records exceeds this value, an invalid error is returned in the job result.
field_delimiter
The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character.
source_uris()
The fully-qualified URIs that point to your data in Google Cloud Storage.
Each URI can contain one ‘*’ wildcard character and it must come after the ‘bucket’ name.
skip_leading_rows
The number of rows at the top of a CSV file that BigQuery will skip when loading the data.
The default value is 0. This property is useful if you have header rows in the file that should be skipped.
allow_jagged_rows
Accept rows that are missing trailing optional columns. The missing values are treated as nulls.
If false, records with missing trailing columns are treated as bad records, and if there are too many bad
records,
an invalid error is returned in the job result. The default value is false. Only applicable to CSV, ignored
for other formats.
ignore_unknown_values
Indicates if BigQuery should allow extra values that are not represented in the table schema.
If true, the extra values are ignored. If false, records with extra columns are treated as bad records,
and if there are too many bad records, an invalid error is returned in the job result. The default value is
false.
The sourceFormat property determines what BigQuery treats as an extra value:
CSV: Trailing columns JSON: Named values that don’t match any column names
allow_quoted_new_lines
Indicates if BigQuery should allow quoted data sections that contain newline characters in a CSV file. The
default value is false.
run()
The task run method, to be overridden in a subclass.
See Task.run
class luigi.contrib.bigquery.BigQueryRunQueryTask(*args, **kwargs)
Bases: luigi.contrib.bigquery.MixinBigQueryBulkComplete, luigi.task.Task
write_disposition
What to do if the table already exists. By default this will fail the job.
See WriteDisposition
create_disposition
Whether to create the table or not. See CreateDisposition
flatten_results
Flattens all nested and repeated fields in the query results. allowLargeResults must be true if this is set to
False.
query
The query, in text form.
query_mode
The query mode. See QueryMode.
udf_resource_uris
Iterator of code resource to load from a Google Cloud Storage URI (gs://bucket/path).
use_legacy_sql
Whether to use legacy SQL
run()
The task run method, to be overridden in a subclass.
See Task.run
class luigi.contrib.bigquery.BigQueryCreateViewTask(*args, **kwargs)
Bases: luigi.task.Task
Creates (or updates) a view in BigQuery.
The output of this task needs to be a BigQueryTarget. Instances of this class should specify the view SQL in the
view property.
If a view already exist in BigQuery at output(), it will be updated.
view
The SQL query for the view, in text form.
complete()
If the task has any outputs, return True if all outputs exist. Otherwise, return False.
However, you may freely override this method with custom logic.
run()
The task run method, to be overridden in a subclass.
See Task.run
class luigi.contrib.bigquery.ExternalBigQueryTask(*args, **kwargs)
Bases: luigi.contrib.bigquery.MixinBigQueryBulkComplete, luigi.task.
ExternalTask
An external task for a BigQuery target.
class luigi.contrib.bigquery.BigQueryExtractTask(*args, **kwargs)
Bases: luigi.task.Task
Extracts (unloads) a table from BigQuery to GCS.
This tasks requires the input to be exactly one BigQueryTarget while the output should be one or more GCSTar-
gets from luigi.contrib.gcs depending on the use of destinationUris property.
destination_uris
The fully-qualified URIs that point to your data in Google Cloud Storage. Each URI can contain one ‘*’
wildcard character and it must come after the ‘bucket’ name.
Wildcarded destinationUris in GCSQueryTarget might not be resolved correctly and result in incomplete
data. If a GCSQueryTarget is used to pass wildcarded destinationUris be sure to overwrite this property to
suppress the warning.
print_header
Whether to print the header or not.
field_delimiter
The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character.
destination_format
The destination format to use (see DestinationFormat).
compression
Whether to use compression.
run()
The task run method, to be overridden in a subclass.
See Task.run
luigi.contrib.bigquery.BigqueryClient
alias of luigi.contrib.bigquery.BigQueryClient
luigi.contrib.bigquery.BigqueryTarget
alias of luigi.contrib.bigquery.BigQueryTarget
luigi.contrib.bigquery.MixinBigqueryBulkComplete
alias of luigi.contrib.bigquery.MixinBigQueryBulkComplete
luigi.contrib.bigquery.BigqueryLoadTask
alias of luigi.contrib.bigquery.BigQueryLoadTask
luigi.contrib.bigquery.BigqueryRunQueryTask
alias of luigi.contrib.bigquery.BigQueryRunQueryTask
luigi.contrib.bigquery.BigqueryCreateViewTask
alias of luigi.contrib.bigquery.BigQueryCreateViewTask
luigi.contrib.bigquery.ExternalBigqueryTask
alias of luigi.contrib.bigquery.ExternalBigQueryTask
luigi.contrib.bigquery_avro module
source_format = 'AVRO'
source_uris()
The fully-qualified URIs that point to your data in Google Cloud Storage.
Each URI can contain one ‘*’ wildcard character and it must come after the ‘bucket’ name.
run()
The task run method, to be overridden in a subclass.
See Task.run
luigi.contrib.datadog_metric module
luigi.contrib.dataproc module
wait_for_job()
class luigi.contrib.dataproc.DataprocSparkTask(*args, **kwargs)
Bases: luigi.contrib.dataproc.DataprocBaseTask
Runs a spark jobs on your Dataproc cluster
main_class = Parameter
jars = Parameter (defaults to )
job_args = Parameter (defaults to )
run()
The task run method, to be overridden in a subclass.
See Task.run
class luigi.contrib.dataproc.DataprocPysparkTask(*args, **kwargs)
Bases: luigi.contrib.dataproc.DataprocBaseTask
Runs a pyspark jobs on your Dataproc cluster
job_file = Parameter
extra_files = Parameter (defaults to )
job_args = Parameter (defaults to )
run()
The task run method, to be overridden in a subclass.
See Task.run
class luigi.contrib.dataproc.CreateDataprocClusterTask(*args, **kwargs)
Bases: luigi.contrib.dataproc._DataprocBaseTask
Task for creating a Dataproc cluster.
gcloud_zone = Parameter (defaults to europe-west1-c)
gcloud_network = Parameter (defaults to default)
master_node_type = Parameter (defaults to n1-standard-2)
master_disk_size = Parameter (defaults to 100)
worker_node_type = Parameter (defaults to n1-standard-2)
worker_disk_size = Parameter (defaults to 100)
worker_normal_count = Parameter (defaults to 2)
worker_preemptible_count = Parameter (defaults to 0)
image_version = Parameter (defaults to )
complete()
If the task has any outputs, return True if all outputs exist. Otherwise, return False.
However, you may freely override this method with custom logic.
run()
The task run method, to be overridden in a subclass.
See Task.run
luigi.contrib.docker_runner module
mount_tmp
run()
The task run method, to be overridden in a subclass.
See Task.run
luigi.contrib.dropbox module
luigi.contrib.dropbox.accept_trailing_slash_in_existing_dirpaths(func)
luigi.contrib.dropbox.accept_trailing_slash(func)
class luigi.contrib.dropbox.DropboxClient(token, user_agent=’Luigi’)
Bases: luigi.target.FileSystem
Dropbox client for authentication, designed to be used by the DropboxTarget class.
Parameters token (str) – Dropbox Oauth2 Token. See DropboxTarget for more information
about generating a token
exists(path, *args, **kwargs)
Return True if file or directory at path exist, False otherwise
Parameters path (str) – a path within the FileSystem to check for existence.
remove(path, *args, **kwargs)
Remove file or directory at location path
Parameters
• path (str) – a path within the FileSystem to remove.
• recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to True.
mkdir(path, *args, **kwargs)
Create directory at location path
Creates the directory at path and implicitly create parent directories if they do not already exist.
Parameters
• path (str) – a path within the FileSystem to create as a directory.
• parents (bool) – Create parent directories when necessary. When parents=False and
the parent directory doesn’t exist, raise luigi.target.MissingParentDirectory
• raise_if_exists (bool) – raise luigi.target.FileAlreadyExists if the folder already
exists.
isdir(path, *args, **kwargs)
Return True if the location at path is a directory. If not, return False.
Parameters path (str) – a path within the FileSystem to check as a directory.
Note: This method is optional, not all FileSystem subclasses implements it.
listdir(path, *args, **kwargs)
Return a list of files rooted in path.
This returns an iterable of the files rooted at path. This is intended to be a recursive listing.
Parameters path (str) – a path within the FileSystem to list.
Note: This method is optional, not all FileSystem subclasses implements it.
move(path, *args, **kwargs)
Move a file, as one would expect.
copy(path, *args, **kwargs)
Copy a file or a directory with contents. Currently, LocalFileSystem and MockFileSystem support only
single file copying but S3Client copies either a file or a directory as required.
download_as_bytes(path)
upload(tmp_path, dest_path)
class luigi.contrib.dropbox.ReadableDropboxFile(path, client)
Bases: object
Represents a file inside the Dropbox cloud which will be read
Parameters
• path (str) – Dropbpx path of the file to be read (always starting with /)
• client (DropboxClient) – a DropboxClient object (initialized with a valid token)
read()
close()
readable()
writable()
seekable()
class luigi.contrib.dropbox.AtomicWritableDropboxFile(path, client)
Bases: luigi.target.AtomicLocalFile
Represents a file that will be created inside the Dropbox cloud
Parameters
• path (str) – Destination path inside Dropbox
• client (DropboxClient) – a DropboxClient object (initialized with a valid token, for
the desired account)
move_to_final_destination()
After editing the file locally, this function uploads it to the Dropbox cloud
class luigi.contrib.dropbox.DropboxTarget(path, token, format=None, user_agent=’Luigi’)
Bases: luigi.target.FileSystemTarget
A Dropbox filesystem target.
Create an Dropbox Target for storing data in a dropbox.com account
About the path parameter
The path must start with ‘/’ and should not end with ‘/’ (even if it is a directory). The path must not contain
adjacent slashes (‘/files//img.jpg’ is an invalid path)
If the app has ‘App folder’ access, then / will refer to this app folder (which mean that there is no need to
prepend the name of the app to the path) Otherwise, if the app has ‘full access’, then / will refer to the root of
the Dropbox folder
About the token parameter:
The Dropbox target requires a valid OAuth2 token as a parameter (which means that a Dropbox API app must
be created. This app can have ‘App folder’ access or ‘Full Dropbox’, as desired).
Information about generating the token can be read here:
• https://dropbox-sdk-python.readthedocs.io/en/latest/api/oauth.html#dropbox.oauth.DropboxOAuth2Flow
• https://blogs.dropbox.com/developers/2014/05/generate-an-access-token-for-your-own-account/
Parameters
• path (str) – Remote path in Dropbox (starting with ‘/’).
• token (str) – a valid OAuth2 Dropbox token.
• format (luigi.Format) – the luigi format to use (e.g. luigi.format.Nop)
fs
temporary_path(**kwds)
A context manager that enables a reasonably short, general and magic-less way to solve the Atomic Writes
Problem.
• On entering, it will create the parent directories so the temporary_path is writeable right away. This
step uses FileSystem.mkdir().
• On exiting, it will move the temporary file if there was no exception thrown. This step uses
FileSystem.rename_dont_move()
The file system operations will be carried out by calling them on fs.
The typical use case looks like this:
class MyTask(luigi.Task):
def output(self):
return MyFileSystemTarget(...)
def run(self):
with self.output().temporary_path() as self.temp_output_path:
run_some_external_command(output_path=self.temp_output_path)
open(mode)
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions. Using b is not supported; initialize with format=Nop instead.
luigi.contrib.ecs module
To use ECS, you create a taskDefinition JSON that defines the docker run command for one or more containers in a
task or service, and then submit this JSON to the API to run the task.
This boto3-powered wrapper allows you to create Luigi Tasks to submit ECS taskDefinition s. You can either
pass a dict (mapping directly to the taskDefinition JSON) OR an Amazon Resource Name (arn) for a previously
registered taskDefinition.
Requires:
• boto3 package
• Amazon AWS credentials discoverable by boto3 (e.g., by using aws configure from awscli)
• A running ECS cluster (see ECS Get Started)
Written and maintained by Jake Feala (@jfeala) for Outlier Bio (@outlierbio)
class luigi.contrib.ecs.ECSTask(*args, **kwargs)
Bases: luigi.task.Task
Base class for an Amazon EC2 Container Service Task
Amazon ECS requires you to register “tasks”, which are JSON descriptions for how to issue the docker run
command. This Luigi Task can either run a pre-registered ECS taskDefinition, OR register the task on the fly
from a Python dict.
Parameters
• task_def_arn – pre-registered task definition ARN (Amazon Resource Name), of the
form:
arn:aws:ecs:<region>:<user_id>:task-definition/<family>:<tag>
task_def = {
'family': 'hello-world',
'volumes': [],
'containerDefinitions': [
{
'memory': 1,
'essential': True,
'name': 'hello-world',
'image': 'ubuntu',
'command': ['/bin/echo', 'hello world']
}
]
}
• cluster – str defining the ECS cluster to use. When this is not defined it will use the
default one.
task_def_arn = OptionalParameter (defaults to None)
task_def = OptionalParameter (defaults to None)
cluster = Parameter (defaults to default)
ecs_task_ids
Expose the ECS task ID
command
Command passed to the containers
Override to return list of dicts with keys ‘name’ and ‘command’, describing the container names and
commands to pass to the container. Directly corresponds to the overrides parameter of runTask API. For
example:
[
{
'name': 'myContainer',
'command': ['/bin/sleep', '60']
}
]
run()
The task run method, to be overridden in a subclass.
See Task.run
luigi.contrib.esindex module
class ExampleIndex(CopyToIndex):
index = 'example'
def docs(self):
return [{'_id': 1, 'title': 'An example document.'}]
if __name__ == '__main__':
task = ExampleIndex()
luigi.build([task], local_scheduler=True)
All options:
class ExampleIndex(CopyToIndex):
host = 'localhost'
port = 9200
index = 'example'
doc_type = 'default'
purge_existing_index = True
marker_index_hist_size = 1
def docs(self):
return [{'_id': 1, 'title': 'An example document.'}]
if __name__ == '__main__':
task = ExampleIndex()
luigi.build([task], local_scheduler=True)
[elasticsearch]
marker-index = update_log
marker-doc-type = entry
Usage:
1. Subclass and override the required index attribute.
2. Implement a custom docs method, that returns an iterable over the documents. A document can be a JSON
string, e.g. from a newline-delimited JSON (ldj) file (default implementation) or some dictionary.
Optional attributes:
• doc_type (default),
• host (localhost),
• port (9200),
• settings ({‘settings’: {}})
• mapping (None),
• chunk_size (2000),
• raise_on_error (True),
• purge_existing_index (False),
• marker_index_hist_size (0)
If settings are defined, they are only applied at index creation time.
host
ES hostname.
port
ES port.
http_auth
ES optional http auth information as either ‘:’ separated string or a tuple, e.g. (‘user’, ‘pass’) or
“user:pass”.
index
The target index.
May exist or not.
doc_type
The target doc_type.
mapping
Dictionary with custom mapping or None.
settings
Settings to be used at index creation time.
chunk_size
Single API call for this number of docs.
raise_on_error
Whether to fail fast.
purge_existing_index
Whether to delete the index completely before any indexing.
marker_index_hist_size
Number of event log entries in the marker index. 0: unlimited.
timeout
Timeout.
extra_elasticsearch_args
Extra arguments to pass to the Elasticsearch constructor
docs()
Return the documents to be indexed.
Beside the user defined fields, the document may contain an _index, _type and _id.
create_index()
Override to provide code for creating the target index.
By default it will be created without any special settings or mappings.
delete_index()
Delete the index, if it exists.
update_id()
This id will be a unique identifier for this indexing task.
output()
Returns a ElasticsearchTarget representing the inserted dataset.
Normally you don’t override this.
run()
Run task, namely:
• purge existing index, if requested (purge_existing_index),
• create the index, if missing,
• apply mappings, if given,
• set refresh interval to -1 (disable) for performance reasons,
• bulk index in batches of size chunk_size (2000),
• set refresh interval to 1s,
• refresh Elasticsearch,
• create entry in marker index.
luigi.contrib.external_daily_snapshot module
class luigi.contrib.external_daily_snapshot.ExternalDailySnapshot(*args,
**kwargs)
Bases: luigi.task.ExternalTask
Abstract class containing a helper method to fetch the latest snapshot.
Example:
class MyTask(luigi.Task):
def requires(self):
return PlaylistContent.latest()
ServiceLogs.latest(service="radio", lookback=21)
date = DateParameter
classmethod latest(*args, **kwargs)
This is cached so that requires() is deterministic.
luigi.contrib.external_program module
build_tracking_url(logs_output)
This method is intended for transforming pattern match in logs to an URL :param logs_output: Found
match of self.tracking_url_pattern :return: a tracking URL for the task
run()
The task run method, to be overridden in a subclass.
See Task.run
class luigi.contrib.external_program.ExternalProgramRunContext(proc)
Bases: object
kill_job(captured_signal=None, stack_frame=None)
exception luigi.contrib.external_program.ExternalProgramRunError(message,
args,
env=None,
std-
out=None,
stderr=None)
Bases: exceptions.RuntimeError
class luigi.contrib.external_program.ExternalPythonProgramTask(*args,
**kwargs)
Bases: luigi.contrib.external_program.ExternalProgramTask
Template task for running an external Python program in a subprocess
Simple extension of ExternalProgramTask, adding two luigi.parameter.Parameter s for setting
a virtualenv and for extending the PYTHONPATH.
virtualenv = Parameter (defaults to None): path to the virtualenv directory to use. It
extra_pythonpath = Parameter (defaults to None): extend the search path for modules by
program_environment()
Override this method to control environment variables for the program
Returns dict mapping environment variable names to values
luigi.contrib.ftp module
This library is a wrapper of ftplib or pysftp. It is convenient to move data from/to (S)FTP servers.
There is an example on how to use it (example/ftp_experiment_outputs.py)
You can also find unittest for each class.
Be aware that normal ftp does not provide secure communication.
class luigi.contrib.ftp.RemoteFileSystem(host, username=None, password=None,
port=None, tls=False, timeout=60, sftp=False,
pysftp_conn_kwargs=None)
Bases: luigi.target.FileSystem
exists(path, mtime=None)
Return True if file or directory at path exist, False otherwise.
Additional check on modified time when mtime is passed in.
Return False if the file’s modified time is older mtime.
remove(path, recursive=True)
Remove file or directory at location path.
Parameters
• path (str) – a path within the FileSystem to remove.
• recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to True.
put(local_path, path, atomic=True)
Put file from local filesystem to (s)FTP.
get(path, local_path)
Download file from (s)FTP to local filesystem.
listdir(path=’.’)
Gets an list of the contents of path in (s)FTP
class luigi.contrib.ftp.AtomicFtpFile(fs, path)
Bases: luigi.target.AtomicLocalFile
Simple class that writes to a temp file and upload to ftp on close().
Also cleans up the temp file if close is not invoked.
Initializes an AtomicFtpfile instance. :param fs: :param path: :type path: str
move_to_final_destination()
fs
class luigi.contrib.ftp.RemoteTarget(path, host, format=None, username=None, pass-
word=None, port=None, mtime=None, tls=False, time-
out=60, sftp=False, pysftp_conn_kwargs=None)
Bases: luigi.target.FileSystemTarget
Target used for reading from remote files.
The target is implemented using intermediate files on the local system. On Python2, these files may not be
cleaned up.
fs
open(mode)
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions.
exists()
Returns True if the path for this FileSystemTarget exists; False otherwise.
This method is implemented by using fs.
put(local_path, atomic=True)
get(local_path)
luigi.contrib.gcp module
luigi.contrib.gcp.get_authenticate_kwargs(oauth_credentials=None, http_=None)
Returns a dictionary with keyword arguments for use with discovery
Prioritizes oauth_credentials or a http client provided by the user If none provided, falls back to default creden-
tials provided by google’s command line utilities. If that also fails, tries using httplib2.Http()
Used by gcs.GCSClient and bigquery.BigQueryClient to initiate the API Client
luigi.contrib.gcs module
credentials = google.auth.jwt.Credentials.from_service_account_info(
'012345678912-ThisIsARandomServiceAccountEmail@developer.
˓→gserviceaccount.com',
'These are the contents of the p12 file that came with the service
˓→account',
scope='https://www.googleapis.com/auth/devstorage.read_write')
client = GCSClient(oauth_credentials=credentails)
or uploading files.
Warning: By default this class will use “automated service discovery” which will require a connection to
the web. The google api client downloads a JSON file to “create” the library interface on the fly. If you want
a more hermetic build, you can pass the contents of this file (currently found at https://www.googleapis.com/
discovery/v1/apis/storage/v1/rest ) as the descriptor argument.
exists(path)
Return True if file or directory at path exist, False otherwise
Parameters path (str) – a path within the FileSystem to check for existence.
isdir(path)
Return True if the location at path is a directory. If not, return False.
Parameters path (str) – a path within the FileSystem to check as a directory.
Note: This method is optional, not all FileSystem subclasses implements it.
remove(path, recursive=True)
Remove file or directory at location path
Parameters
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions. Using b is not supported; initialize with format=Nop instead.
class luigi.contrib.gcs.GCSFlagTarget(path, format=None, client=None,
flag=’_SUCCESS’)
Bases: luigi.contrib.gcs.GCSTarget
Defines a target directory with a flag-file (defaults to _SUCCESS) used to signify job success.
This checks for two things:
• the path exists (just like the GCSTarget)
• the _SUCCESS file exists within the directory.
Because Hadoop outputs into a directory and not a single file, the path is assumed to be a directory.
This is meant to be a handy alternative to AtomicGCSFile.
The AtomicFile approach can be burdensome for GCS since there are no directories, per se.
If we have 1,000,000 output files, then we have to rename 1,000,000 objects.
Initializes a GCSFlagTarget.
Parameters
• path (str) – the directory where the files are stored.
• client –
• flag (str) –
fs = None
exists()
Returns True if the path for this FileSystemTarget exists; False otherwise.
This method is implemented by using fs.
luigi.contrib.hadoop module
Run Hadoop Mapreduce jobs using Hadoop Streaming. To run a job, you need to subclass luigi.contrib.
hadoop.JobTask and implement a mapper and reducer methods. See Example – Top Artists for an example
of how to run a Hadoop job.
class luigi.contrib.hadoop.hadoop(*args, **kwargs)
Bases: luigi.task.Config
pool = OptionalParameter (defaults to None): Hadoop pool so use for Hadoop tasks. To s
luigi.contrib.hadoop.attach(*packages)
Attach a python package to hadoop map reduce tarballs to make those packages available on the hadoop cluster.
luigi.contrib.hadoop.dereference(f )
luigi.contrib.hadoop.get_extra_files(extra_files)
luigi.contrib.hadoop.create_packages_archive(packages, filename)
Create a tar archive which will contain the files for the packages listed in packages.
luigi.contrib.hadoop.flatten(sequence)
A simple generator which flattens a sequence.
Only one level is flattened.
class luigi.contrib.hadoop.HadoopRunContext
Bases: object
kill_job(captured_signal=None, stack_frame=None)
exception luigi.contrib.hadoop.HadoopJobError(message, out=None, err=None)
Bases: exceptions.RuntimeError
luigi.contrib.hadoop.run_and_track_hadoop_job(arglist, tracking_url_callback=None,
env=None)
Runs the job by invoking the command from the given arglist. Finds tracking urls from the output and attempts to
fetch errors using those urls if the job fails. Throws HadoopJobError with information about the error (including
stdout and stderr from the process) on failure and returns normally otherwise.
Parameters
• arglist –
• tracking_url_callback –
• env –
Returns
luigi.contrib.hadoop.fetch_task_failures(tracking_url)
Uses mechanize to fetch the actual task logs from the task tracker.
This is highly opportunistic, and we might not succeed. So we set a low timeout and hope it works. If it does
not, it’s not the end of the world.
TODO: Yarn has a REST API that we should probably use instead: http://hadoop.apache.org/docs/current/
hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html
class luigi.contrib.hadoop.JobRunner
Bases: object
run_job = NotImplemented
class luigi.contrib.hadoop.HadoopJobRunner(streaming_jar, modules=None, stream-
ing_args=None, libjars=None, lib-
jars_in_hdfs=None, jobconfs=None, in-
put_format=None, output_format=None,
end_job_with_atomic_move_dir=True,
archives=None)
Bases: luigi.contrib.hadoop.JobRunner
Takes care of uploading & executing a Hadoop job using Hadoop streaming.
TODO: add code to support Elastic Mapreduce (using boto) and local execution.
run_job(job, tracking_url_callback=None)
finish()
class luigi.contrib.hadoop.DefaultHadoopJobRunner
Bases: luigi.contrib.hadoop.HadoopJobRunner
The default job runner just reads from config and sets stuff.
class luigi.contrib.hadoop.LocalJobRunner(samplelines=None)
Bases: luigi.contrib.hadoop.JobRunner
Will run the job locally.
This is useful for debugging and also unit testing. Tries to mimic Hadoop Streaming.
TODO: integrate with JobTask
sample(input_stream, n, output)
group(input_stream)
run_job(job)
class luigi.contrib.hadoop.BaseHadoopJobTask(*args, **kwargs)
Bases: luigi.task.Task
pool = Insignificant OptionalParameter (defaults to None)
batch_counter_default = 1
final_mapper = NotImplemented
final_combiner = NotImplemented
final_reducer = NotImplemented
mr_priority = NotImplemented
package_binary = None
task_id = None
job_runner()
jobconfs()
init_local()
Implement any work to setup any internal datastructure etc here.
You can add extra input using the requires_local/input_local methods.
Anything you set on the object will be pickled and available on the Hadoop nodes.
init_hadoop()
data_interchange_format = 'python'
run()
The task run method, to be overridden in a subclass.
See Task.run
requires_local()
Default impl - override this method if you need any local input to be accessible in init().
requires_hadoop()
input_local()
input_hadoop()
deps()
Internal method used by the scheduler.
Returns the flattened list of requires.
on_failure(exception)
Override for custom error handling.
This method gets called if an exception is raised in run(). The returned value of this method is json
encoded and sent to the scheduler as the expl argument. Its string representation will be used as the body
of the error email sent out if any.
Default behavior is to return a string representation of the stack trace.
class luigi.contrib.hadoop.JobTask(*args, **kwargs)
Bases: luigi.contrib.hadoop.BaseHadoopJobTask
jobconf_truncate = 20000
n_reduce_tasks = 25
reducer = NotImplemented
jobconfs()
init_mapper()
init_combiner()
init_reducer()
job_runner()
Get the MapReduce runner for this job.
If all outputs are HdfsTargets, the DefaultHadoopJobRunner will be used. Otherwise, the LocalJobRunner
which streams all data through the local machine will be used (great for testing).
reader(input_stream)
Reader is a method which iterates over input lines and outputs records.
The default implementation yields one argument containing the line for each line in the input.
writer(outputs, stdout, stderr=<open file ’<stderr>’, mode ’w’>)
Writer format is a method which iterates over the output records from the reducer and formats them for
output.
The default implementation outputs tab separated items.
mapper(item)
Re-define to process an input item (usually a line of input data).
Defaults to identity mapper that sends all lines to the same reducer.
combiner = NotImplemented
incr_counter(*args, **kwargs)
Increments a Hadoop counter.
Since counters can be a bit slow to update, this batches the updates.
extra_modules()
extra_files()
Can be overriden in subclass.
Each element is either a string, or a pair of two strings (src, dst).
• src can be a directory (in which case everything will be copied recursively).
• dst can include subdirectories (foo/bar/baz.txt etc)
Uses Hadoop’s -files option so that the same file is reused across tasks.
extra_streaming_arguments()
Extra arguments to Hadoop command line. Return here a list of (parameter, value) tuples.
extra_archives()
List of paths to archives
add_link(src, dst)
dump(directory=”)
Dump instance to file.
run_mapper(stdin=<open file ’<stdin>’, mode ’r’>, stdout=<open file ’<stdout>’, mode ’w’>)
Run the mapper on the hadoop node.
run_reducer(stdin=<open file ’<stdin>’, mode ’r’>, stdout=<open file ’<stdout>’, mode ’w’>)
Run the reducer on the hadoop node.
run_combiner(stdin=<open file ’<stdin>’, mode ’r’>, stdout=<open file ’<stdout>’, mode ’w’>)
internal_reader(input_stream)
Reader which uses python eval on each part of a tab separated string. Yields a tuple of python objects.
internal_writer(outputs, stdout)
Writer which outputs the python repr for each item.
luigi.contrib.hadoop_jar module
ssh()
Set this to run hadoop command remotely via ssh. It needs to be a dict that looks like {“host”: “myhost”,
“key_file”: None, “username”: None, [“no_host_key_check”: False]}
args()
Returns an array of args to pass to the job (after hadoop jar <jar> <main>).
luigi.contrib.hive module
class luigi.contrib.hive.ApacheHiveCommandClient
Bases: luigi.contrib.hive.HiveCommandClient
A subclass for the HiveCommandClient to (in some cases) ignore the return code from the hive command so
that we can just parse the output.
table_schema(table, database=’default’)
Returns list of [(name, type)] for each column in database.table.
class luigi.contrib.hive.MetastoreClient
Bases: luigi.contrib.hive.HiveClient
table_location(table, database=’default’, partition=None)
Returns location of db.table (or db.table.partition). partition is a dict of partition key to value.
table_exists(table, database=’default’, partition=None)
Returns true if db.table (or db.table.partition) exists. partition is a dict of partition key to value.
table_schema(table, database=’default’)
Returns list of [(name, type)] for each column in database.table.
partition_spec(partition)
Turn a dict into a string partition specification
class luigi.contrib.hive.HiveThriftContext
Bases: object
Context manager for hive metastore client.
class luigi.contrib.hive.WarehouseHiveClient(hdfs_client=None, ware-
house_location=None)
Bases: luigi.contrib.hive.HiveClient
Client for managed tables that makes decision based on presence of directory in hdfs
table_schema(table, database=’default’)
Returns list of [(name, type)] for each column in database.table.
table_location(table, database=’default’, partition=None)
Returns location of db.table (or db.table.partition). partition is a dict of partition key to value.
table_exists(table, database=’default’, partition=None)
The table/partition is considered existing if corresponding path in hdfs exists and contains file except those
which match pattern set in ignored_file_masks
partition_spec(partition)
Turn a dict into a string partition specification
luigi.contrib.hive.get_default_client()
class luigi.contrib.hive.HiveQueryTask(*args, **kwargs)
Bases: luigi.contrib.hadoop.BaseHadoopJobTask
Task to run a hive query.
n_reduce_tasks = None
bytes_per_reducer = None
reducers_max = None
query()
Text of query to run in hive
hiverc()
Location of an rc file to run before the query if hiverc-location key is specified in luigi.cfg, will default to
the value there otherwise returns None.
Returning a list of rc files will load all of them in order.
hivevars()
Returns a dict of key=value settings to be passed along to the hive command line via –hivevar. This option
can be used as a separated namespace for script local variables. See https://cwiki.apache.org/confluence/
display/Hive/LanguageManual+VariableSubstitution
hiveconfs()
Returns a dict of key=value settings to be passed along to the hive command line via –hiveconf. By default,
sets mapred.job.name to task_id and if not None, sets:
• mapred.reduce.tasks (n_reduce_tasks)
• mapred.fairscheduler.pool (pool) or mapred.job.queue.name (pool)
• hive.exec.reducers.bytes.per.reducer (bytes_per_reducer)
• hive.exec.reducers.max (reducers_max)
job_runner()
class luigi.contrib.hive.HiveQueryRunner
Bases: luigi.contrib.hadoop.JobRunner
Runs a HiveQueryTask by shelling out to hive.
prepare_outputs(job)
Called before job is started.
If output is a FileSystemTarget, create parent directories so the hive command won’t fail
get_arglist(f_name, job)
run_job(job, tracking_url_callback=None)
class luigi.contrib.hive.HivePartitionTarget(table, partition, database=’default’,
fail_missing_table=True, client=None)
Bases: luigi.target.Target
Target representing Hive table or Hive partition
@param table: Table name @type table: str @param partition: partition specificaton in form of dict of {“parti-
tion_column_1”: “partition_value_1”, “partition_column_2”: “partition_value_2”, . . . } If partition is None or
{} then target is Hive nonpartitioned table @param database: Database name @param fail_missing_table: flag
to ignore errors raised due to table nonexistence @param client: HiveCommandClient instance. Default if client
is None
exists()
returns True if the partition/table exists
path
Returns the path for this HiveTablePartitionTarget’s data.
class luigi.contrib.hive.HiveTableTarget(table, database=’default’, client=None)
Bases: luigi.contrib.hive.HivePartitionTarget
Target representing non-partitioned table
class luigi.contrib.hive.ExternalHiveTask(*args, **kwargs)
Bases: luigi.task.ExternalTask
External task that depends on a Hive table/partition.
luigi.contrib.kubernetes module
{
"containers": [{
"name": "pi",
"image": "perl",
"command": ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
}],
"restartPolicy": "Never"
}
restartPolicy
• If restartPolicy is not defined, it will be set to “Never” by default.
• Warning: restartPolicy=OnFailure will bypass max_retrials, and restart the container until success,
with the risk of blocking the Luigi task.
For more informations please refer to: http://kubernetes.io/docs/user-guide/pods/multi-container/
#the-spec-schema
max_retrials
Maximum number of retrials in case of failure.
backoff_limit
Maximum number of retries before considering the job as failed. See: https://kubernetes.io/docs/concepts/
workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy
delete_on_success
Delete the Kubernetes workload if the job has ended successfully.
print_pod_logs_on_exit
Fetch and print the pod logs once the job is completed.
active_deadline_seconds
Time allowed to successfully schedule pods. See: https://kubernetes.io/docs/concepts/workloads/
controllers/jobs-run-to-completion/#job-termination-and-cleanup
kubernetes_config
poll_interval
How often to poll Kubernetes for job status, in seconds.
pod_creation_wait_interal
Delay for initial pod creation for just submitted job in seconds
signal_complete()
Signal job completion for scheduler and dependent tasks.
Touching a system file is an easy way to signal completion. example:: .. code-block:: python
with self.output().open(‘w’) as output_file: output_file.write(‘’)
run()
The task run method, to be overridden in a subclass.
See Task.run
output()
An output target is necessary for checking job completion unless an alternative complete method is defined.
Example:
luigi.contrib.lsf module
luigi.contrib.lsf.track_job(job_id)
Tracking is done by requesting each job and then searching for whether the job has one of the following states:
- “RUN”, - “PEND”, - “SSUSP”, - “EXIT” based on the LSF documentation
luigi.contrib.lsf.kill_job(job_id)
Kill a running LSF job
class luigi.contrib.lsf.LSFJobTask(*args, **kwargs)
Bases: luigi.task.Task
Takes care of uploading and executing an LSF job
n_cpu_flag = Insignificant IntParameter (defaults to 2)
shared_tmp_dir = Insignificant Parameter (defaults to /tmp)
resource_flag = Insignificant Parameter (defaults to mem=8192)
memory_flag = Insignificant Parameter (defaults to 8192)
queue_flag = Insignificant Parameter (defaults to queue_name)
runtime_flag = IntParameter (defaults to 60)
job_name_flag = Parameter (defaults to )
poll_time = Insignificant FloatParameter (defaults to 5): specify the wait time to pol
save_job_info = BoolParameter (defaults to False)
output = Parameter (defaults to )
extra_bsub_args = Parameter (defaults to )
job_status = None
fetch_task_failures()
Read in the error file from bsub
fetch_task_output()
Read in the output file
init_local()
Implement any work to setup any internal datastructure etc here. You can add extra input using the re-
quires_local/input_local methods. Anything you set on the object will be pickled and available on the
compute nodes.
run()
The procedure: - Pickle the class - Tarball the dependencies - Construct a bsub argument that runs a generic
runner function with the path to the pickled class - Runner function loads the class from pickle - Runner
class untars the dependencies - Runner function hits the button on the class’s work() method
work()
Subclass this for where you’re doing your actual work.
Why not run(), like other tasks? Because we need run to always be something that the Worker can call,
and that’s the real logical place to do LSF scheduling. So, the work will happen in work().
class luigi.contrib.lsf.LocalLSFJobTask(*args, **kwargs)
Bases: luigi.contrib.lsf.LSFJobTask
A local version of JobTask, for easier debugging.
run()
The procedure: - Pickle the class - Tarball the dependencies - Construct a bsub argument that runs a generic
runner function with the path to the pickled class - Runner function loads the class from pickle - Runner
class untars the dependencies - Runner function hits the button on the class’s work() method
luigi.contrib.lsf_runner module
luigi.contrib.lsf_runner.do_work_on_compute_node(work_dir)
luigi.contrib.lsf_runner.extract_packages_archive(work_dir)
luigi.contrib.lsf_runner.main(args=[’/home/docs/checkouts/readthedocs.org/user_builds/luigi/envs/stable/bin/sphinx-
build’, ’-b’, ’latex’, ’-D’, ’language=en’, ’-d’, ’_build/doctrees’,
’.’, ’_build/latex’])
Run the work() method from the class instance in the file “job-instance.pickle”.
luigi.contrib.mongodb module
exists()
Test if the target has been run Target is considered run if the number of items in the target matches value
of self._target_count
read()
Using the aggregate method to avoid inaccurate count if using a sharded cluster https://docs.mongodb.
com/manual/reference/method/db.collection.count/#behavior
luigi.contrib.mrrunner module
Since after Luigi 2.5.0, this is a private module to Luigi. Luigi users should not rely on that importing this module
works. Furthermore, “luigi mr streaming” have been greatly superseeded by technoligies like Spark, Hive, etc.
The hadoop runner.
This module contains the main() method which will be used to run the mapper, combiner, or reducer on the Hadoop
nodes.
class luigi.contrib.mrrunner.Runner(job=None)
Bases: object
Run the mapper, combiner, or reducer on hadoop nodes.
run(kind, stdin=<open file ’<stdin>’, mode ’r’>, stdout=<open file ’<stdout>’, mode ’w’>)
extract_packages_archive()
luigi.contrib.mrrunner.print_exception(exc)
luigi.contrib.mrrunner.main(args=None, stdin=<open file ’<stdin>’, mode ’r’>, std-
out=<open file ’<stdout>’, mode ’w’>, print_exception=<function
print_exception>)
Run either the mapper, combiner, or reducer from the class instance in the file “job-instance.pickle”.
Arguments:
kind – is either map, combiner, or reduce
luigi.contrib.mssqldb module
touch(connection=None)
Mark this update as complete.
IMPORTANT, If the marker table doesn’t exist, the connection transaction will be aborted and the connec-
tion reset. Then the marker table will be created.
exists(connection=None)
Returns True if the Target exists and False otherwise.
connect()
Create a SQL Server connection and return a connection object
create_marker_table()
Create marker table if it doesn’t exist. Use a separate connection since the transaction might have to be
reset.
luigi.contrib.mysqldb module
To customize how to access data from an input task, override the rows method with a generator that yields each
row as a tuple with fields ordered according to columns.
rows()
Return/yield tuples or lists corresponding to each row to be inserted.
output()
Returns a MySqlTarget representing the inserted dataset.
Normally you don’t override this.
copy(cursor, file=None)
run()
Inserts data generated by rows() into target table.
If the target table doesn’t exist, self.create_table will be called to attempt to create the table.
Normally you don’t want to override this.
bulk_size
luigi.contrib.opener module
OpenerTarget support, allows easier testing and configuration by abstracting out the LocalTarget, S3Target, and Mock-
Target types.
Example:
OpenerTarget('/local/path.txt')
OpenerTarget('s3://zefr/remote/path.txt')
exception luigi.contrib.opener.OpenerError
Bases: luigi.target.FileSystemException
The base exception thrown by openers
exception luigi.contrib.opener.NoOpenerError
Bases: luigi.contrib.opener.OpenerError
Thrown when there is no opener for the given protocol
exception luigi.contrib.opener.InvalidQuery
Bases: luigi.contrib.opener.OpenerError
Thrown when an opener is passed unexpected arguments
class luigi.contrib.opener.OpenerRegistry(openers=None)
Bases: object
An opener registry that stores a number of opener objects used to parse Target URIs
Parameters openers (list) – A list of objects inherited from the Opener class.
get_opener(name)
Retrieve an opener for the given protocol
Parameters name (string) – name of the opener to open
Raises NoOpenerError – if no opener has been registered of that name
add(opener)
Adds an opener to the registry
Parameters opener (Opener inherited object) – Opener object
open(target_uri, **kwargs)
Open target uri.
Parameters target_uri (string) – Uri to open
Returns Target object
class luigi.contrib.opener.Opener
Bases: object
Base class for Opener objects.
allowed_kwargs = {}
filter_kwargs = True
classmethod conform_query(query)
Converts the query string from a target uri, uses cls.allowed_kwargs, and cls.filter_kwargs to drive logic.
Parameters query (urllib.parse.unsplit(uri)query) – Unparsed query string
Returns Dictionary of parsed values, everything in cls.allowed_kwargs with values set to True
will be parsed as json strings.
classmethod get_target(scheme, path, fragment, username, password, hostname, port, query,
**kwargs)
Override this method to use values from the parsed uri to initialize the expected target.
class luigi.contrib.opener.MockOpener
Bases: luigi.contrib.opener.Opener
Mock target opener, works like LocalTarget but files are all in memory.
example: * mock://foo/bar.txt
names = ['mock']
allowed_kwargs = {'format': False, 'is_tmp': True, 'mirror_on_stderr': True}
classmethod get_target(scheme, path, fragment, username, password, hostname, port, query,
**kwargs)
Override this method to use values from the parsed uri to initialize the expected target.
class luigi.contrib.opener.LocalOpener
Bases: luigi.contrib.opener.Opener
Local filesystem opener, works with any valid system path. This is the default opener and will be used if you
don’t indicate which opener.
examples: * file://relative/foo/bar/baz.txt (opens a relative file) * file:///home/user (opens a directory from a
absolute path) * foo/bar.baz (file:// is the default opener)
names = ['file']
allowed_kwargs = {'format': False, 'is_tmp': True}
classmethod get_target(scheme, path, fragment, username, password, hostname, port, query,
**kwargs)
Override this method to use values from the parsed uri to initialize the expected target.
class luigi.contrib.opener.S3Opener
Bases: luigi.contrib.opener.Opener
Opens a target stored on Amazon S3 storage
examples: * s3://bucket/foo/bar.txt * s3://bucket/foo/bar.txt?aws_access_key_id=xxx&aws_secret_access_key=yyy
luigi.contrib.pai module
{
"jobName": String,
"image": String,
"authFile": String,
"dataDir": String,
"outputDir": String,
"codeDir": String,
"virtualCluster": String,
"taskRoles": [
{
"name": String,
"taskNumber": Integer,
"cpuNumber": Integer,
"memoryMB": Integer,
"shmMB": Integer,
"gpuNumber": Integer,
"portList": [
{
"label": String,
"beginAt": Integer,
"portNumber": Integer
(continues on next page)
• name – Name for the task role, need to be unique with other roles, required
• command – Executable command for tasks in the task role, can not be empty, required
• taskNumber – Number of tasks for the task role, no less than 1, required
• cpuNumber – CPU number for one task in the task role, no less than 1, required
• shmMB – Shared memory for one task in the task role, no more than memory size, required
• memoryMB – Memory for one task in the task role, no less than 100, required
• gpuNumber – GPU number for one task in the task role, no less than 0, required
• portList – List of portType to use, optional
name
command
taskNumber
cpuNumber
memoryMB
shmMB
gpuNumber
portList
minFailedTaskCount
minSucceededTaskCount
class luigi.contrib.pai.OpenPai(*args, **kwargs)
Bases: luigi.task.Config
pai_url = Parameter (defaults to http://127.0.0.1:9186): rest server url, default is h
username = Parameter (defaults to admin): your username
password = Parameter (defaults to None): your password
expiration = IntParameter (defaults to 3600): expiration time in seconds
class luigi.contrib.pai.PaiTask(*args, **kwargs)
Bases: luigi.task.Task
Parameters
• pai_url – The rest server url of PAI clusters, default is ‘http://127.0.0.1:9186’.
• token – The token used to auth the rest server of PAI.
name
Name for the job, need to be unique, required
image
URL pointing to the Docker image for all tasks in the job, required
tasks
List of taskRole, one task role at least, required
auth_file_path
Docker registry authentication file existing on HDFS, optional
data_dir
Data directory existing on HDFS, optional
code_dir
Code directory existing on HDFS, should not contain any data and should be less than 200MB, optional
output_dir
Output directory on HDFS, $PAI_DEFAULT_FS_URI/$jobName/output will be used if not specified, op-
tional
virtual_cluster
The virtual cluster job runs on. If omitted, the job will run on default virtual cluster, optional
gpu_type
Specify the GPU type to be used in the tasks. If omitted, the job will run on any gpu type, optional
retry_count
Job retry count, no less than 0, optional
run()
The task run method, to be overridden in a subclass.
See Task.run
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
complete()
If the task has any outputs, return True if all outputs exist. Otherwise, return False.
However, you may freely override this method with custom logic.
luigi.contrib.pig module
[pig]
# pig home directory
home: /usr/share/pig
pig_properties()
Dictionary of properties that should be set when running Pig.
Example:
return { 'pig.additional.jars':'/path/to/your/jar' }
pig_parameters()
Dictionary of parameters that should be set for the Pig job.
Example:
pig_options()
List of options that will be appended to the Pig command.
Example:
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
pig_script_path()
Return the path to the Pig script to be run.
run()
The task run method, to be overridden in a subclass.
See Task.run
track_and_progress(cmd)
class luigi.contrib.pig.PigRunContext
Bases: object
kill_job(captured_signal=None, stack_frame=None)
exception luigi.contrib.pig.PigJobError(message, out=None, err=None)
Bases: exceptions.RuntimeError
luigi.contrib.postgres module
Implements a subclass of Target that writes data to Postgres. Also provides a helper task to copy data into a Postgres
table.
class luigi.contrib.postgres.MultiReplacer(replace_pairs)
Bases: object
Object for one-pass replace of multiple words
Substituted parts will not be matched against other replace patterns, as opposed to when using multipass replace.
The order of the items in the replace_pairs input will dictate replacement precedence.
Constructor arguments: replace_pairs – list of 2-tuples which hold strings to be replaced and replace string
Usage:
To customize how to access data from an input task, override the rows method with a generator that yields each
row as a tuple with fields ordered according to columns.
rows()
Return/yield tuples or lists corresponding to each row to be inserted.
map_column(value)
Applied to each column of every row returned by rows.
Default behaviour is to escape special characters and identify any self.null_values.
output()
Returns a PostgresTarget representing the inserted dataset.
Normally you don’t override this.
copy(cursor, file)
run()
Inserts data generated by rows() into target table.
If the target table doesn’t exist, self.create_table will be called to attempt to create the table.
Normally you don’t want to override this.
class luigi.contrib.postgres.PostgresQuery(*args, **kwargs)
Bases: luigi.contrib.rdbms.Query
Template task for querying a Postgres compatible database
Usage: Subclass and override the required host, database, user, password, table, and query attributes. Option-
ally one can override the autocommit attribute to put the connection for the query in autocommit mode.
Override the run method if your use case requires some action with the query result.
Task instances require a dynamic update_id, e.g. via parameter(s), otherwise the query will only execute once
To customize the query signature as recorded in the database marker table, override the update_id property.
run()
The task run method, to be overridden in a subclass.
See Task.run
output()
Returns a PostgresTarget representing the executed query.
Normally you don’t override this.
luigi.contrib.presto module
poll_interval = FloatParameter (defaults to 1.0): how often to ask the Presto REST int
class luigi.contrib.presto.PrestoClient(connection, sleep_time=1)
Helper class wrapping pyhive.presto.Connection for executing presto queries and tracking progress
percentage_progress
Returns percentage of query overall progress
info_uri
Returns query UI link
execute(query, parameters=None, mode=None)
Parameters
• query – query to run
• parameters – parameters should be injected in the query
• mode – “fetch” - yields rows, “watch” - yields log entries
Returns
class luigi.contrib.presto.WithPrestoClient
Bases: luigi.task_register.Register
A metaclass for injecting PrestoClient as a _client field into a new instance of class T Presto connection options
are taken from T-instance fields Fields should have the same names as in pyhive.presto.Cursor
class luigi.contrib.presto.PrestoTarget(client, catalog, database, table, partition=None)
Bases: luigi.target.Target
Target for presto-accessible tables
count()
exists()
Returns True if given table exists and there are any rows in a given partition False if no rows in
the partition exists or table is absent
class luigi.contrib.presto.PrestoTask(*args, **kwargs)
Bases: luigi.contrib.rdbms.Query
Task for executing presto queries During its executions tracking url and percentage progress are set
host
port
user
username
schema
password
catalog
poll_interval
source
partition
protocol
session_props
requests_session
requests_kwargs
query = None
run()
The task run method, to be overridden in a subclass.
See Task.run
output()
Override with an RDBMS Target (e.g. PostgresTarget or RedshiftTarget) to record execution in a marker
table
luigi.contrib.prometheus_metric module
luigi.contrib.pyspark_runner module
luigi.contrib.rdbms module
• host,
• database,
• user,
• password,
• table
• columns
• port
host
database
user
password
table
port
columns = []
null_values = (None,)
column_separator = '\t'
create_table(connection)
Override to provide code for creating the target table.
By default it will be created using types (optionally) specified in columns.
If overridden, use the provided connection object for setting up the table in order to create the table and
insert data using the same transaction.
update_id
This update id will be a unique identifier for this insert on this table.
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
init_copy(connection)
Override to perform custom queries.
Any code here will be formed in the same transaction as the main copy, just prior to copying data. Example
use cases include truncating the table or removing all data older than X in the database to keep a rolling
window of data available in the table.
post_copy(connection)
Override to perform custom queries.
Any code here will be formed in the same transaction as the main copy, just after copying data. Example
use cases include cleansing data in temp table prior to insertion into real table.
copy(cursor, file)
class luigi.contrib.rdbms.Query(*args, **kwargs)
Bases: luigi.task.MixinNaiveBulkComplete, luigi.task.Task
An abstract task for executing an RDBMS query.
Usage:
Subclass and override the following attributes:
• host,
• database,
• user,
• password,
• table,
• query
Optionally override:
• port,
• autocommit
• update_id
Subclass and override the following methods:
• run
• output
host
Host of the RDBMS. Implementation should support hostname:port to encode port.
port
Override to specify port separately from host.
database
user
password
table
query
autocommit
update_id
Override to create a custom marker table ‘update_id’ signature for Query subclass task instances
run()
The task run method, to be overridden in a subclass.
See Task.run
output()
Override with an RDBMS Target (e.g. PostgresTarget or RedshiftTarget) to record execution in a marker
table
luigi.contrib.redis_store module
luigi.contrib.redshift module
variables are override, an exception is raised to remind the user to implement all or none. Prune (data
newer than prune_date deleted) before copying new data in.
table_type
Return table type (i.e. ‘temp’).
queries
Override to return a list of queries to be executed in order.
truncate_table(connection)
prune(connection)
create_schema(connection)
Will create the schema in the database
create_table(connection)
Override to provide code for creating the target table.
By default it will be created using types (optionally) specified in columns.
If overridden, use the provided connection object for setting up the table in order to create the table and
insert data using the same transaction.
run()
If the target table doesn’t exist, self.create_table will be called to attempt to create the table.
copy(cursor, f )
Defines copying from s3 into redshift.
If both key-based and role-based credentials are provided, role-based will be used.
output()
Returns a RedshiftTarget representing the inserted dataset.
Normally you don’t override this.
does_schema_exist(connection)
Determine whether the schema already exists.
does_table_exist(connection)
Determine whether the table already exists.
init_copy(connection)
Perform pre-copy sql - such as creating table, truncating, or removing data older than x.
post_copy(cursor)
Performs post-copy sql - such as cleansing data, inserting into production table (if copied to temp table),
etc.
post_copy_metacolums(cursor)
Performs post-copy to fill metadata columns.
class luigi.contrib.redshift.S3CopyJSONToTable(*args, **kwargs)
Bases: luigi.contrib.redshift.S3CopyToTable, luigi.contrib.redshift.
_CredentialsMixin
Template task for inserting a JSON data set into Redshift from s3.
Usage:
• Subclass and override the required attributes:
– host,
– database,
– user,
– password,
– table,
– columns,
– s3_load_path,
– jsonpath,
– copy_json_options.
• You can also override the attributes provided by the CredentialsMixin if they are not supplied by your
configuration or environment variables.
jsonpath
Override the jsonpath schema location for the table.
copy_json_options
Add extra copy options, for example:
• GZIP
• LZOP
copy(cursor, f )
Defines copying JSON from s3 into redshift.
class luigi.contrib.redshift.RedshiftManifestTask(*args, **kwargs)
Bases: luigi.contrib.s3.S3PathTask
Generic task to generate a manifest file that can be used in S3CopyToTable in order to copy multiple files from
your s3 folder into a redshift table at once.
For full description on how to use the manifest file see http://docs.aws.amazon.com/redshift/latest/dg/
loading-data-files-using-manifest.html
Usage:
• requires parameters
– path - s3 path to the generated manifest file, including the name of the generated file to be
copied into a redshift table
– folder_paths - s3 paths to the folders containing files you wish to be copied
Output:
• generated manifest file
folder_paths = Parameter
text_target = True
run()
The task run method, to be overridden in a subclass.
See Task.run
class luigi.contrib.redshift.KillOpenRedshiftSessions(*args, **kwargs)
Bases: luigi.task.Task
An task for killing any open Redshift sessions in a given database. This is necessary to prevent open user
sessions with transactions against the table from blocking drop or truncate table commands.
Usage:
Subclass and override the required host, database, user, and password attributes.
connection_reset_wait_seconds = IntParameter (defaults to 60)
host
database
user
password
update_id
This update id will be a unique identifier for this insert on this table.
output()
Returns a RedshiftTarget representing the inserted dataset.
Normally you don’t override this.
run()
Kill any open Redshift sessions for the given database.
class luigi.contrib.redshift.RedshiftQuery(*args, **kwargs)
Bases: luigi.contrib.postgres.PostgresQuery
Template task for querying an Amazon Redshift database
Usage: Subclass and override the required host, database, user, password, table, and query attributes.
Override the run method if your use case requires some action with the query result.
Task instances require a dynamic update_id, e.g. via parameter(s), otherwise the query will only execute once
To customize the query signature as recorded in the database marker table, override the update_id property.
output()
Returns a RedshiftTarget representing the executed query.
Normally you don’t override this.
class luigi.contrib.redshift.RedshiftUnloadTask(*args, **kwargs)
Bases: luigi.contrib.postgres.PostgresQuery, luigi.contrib.redshift.
_CredentialsMixin
Template task for running UNLOAD on an Amazon Redshift database
Usage: Subclass and override the required host, database, user, password, table, and query attributes. Option-
ally, override the autocommit attribute to run the query in autocommit mode - this is necessary to run VACUUM
for example. Override the run method if your use case requires some action with the query result. Task instances
require a dynamic update_id, e.g. via parameter(s), otherwise the query will only execute once To customize the
query signature as recorded in the database marker table, override the update_id property. You can also override
the attributes provided by the CredentialsMixin if they are not supplied by your configuration or environment
variables.
s3_unload_path
Override to return the load path.
unload_options
Add extra or override default unload options:
unload_query
Default UNLOAD command
run()
The task run method, to be overridden in a subclass.
See Task.run
output()
Returns a RedshiftTarget representing the executed query.
Normally you don’t override this.
luigi.contrib.s3 module
Implementation of Simple Storage Service support. S3Target is a subclass of the Target class to support S3 file
system operations. The boto3 library is required to use S3 targets.
exception luigi.contrib.s3.InvalidDeleteException
Bases: luigi.target.FileSystemException
exception luigi.contrib.s3.FileNotFoundException
Bases: luigi.target.FileSystemException
exception luigi.contrib.s3.DeprecatedBotoClientException
Bases: exceptions.Exception
class luigi.contrib.s3.S3Client(aws_access_key_id=None, aws_secret_access_key=None,
aws_session_token=None, **kwargs)
Bases: luigi.target.FileSystem
boto3-powered S3 client.
DEFAULT_PART_SIZE = 8388608
DEFAULT_THREADS = 100
s3
exists(path)
Does provided path exist on S3?
remove(path, recursive=True)
Remove a file or directory from S3. :param path: File or directory to remove :param recursive: Boolean
indicator to remove object and children :return: Boolean indicator denoting success of the removal of 1 or
more files
move(source_path, destination_path, **kwargs)
Rename/move an object from one S3 location to another. :param source_path: The s3:// path of the
directory or key to copy from :param destination_path: The s3:// path of the directory or key to copy to
:param kwargs: Keyword arguments are passed to the boto3 function copy
get_key(path)
Returns the object summary at the path
put(local_path, destination_s3_path, **kwargs)
Put an object stored locally to an S3 path. :param local_path: Path to source local file :param destina-
tion_s3_path: URL for target S3 location :param kwargs: Keyword arguments are passed to the boto
function put_object
put_string(content, destination_s3_path, **kwargs)
Put a string to an S3 path. :param content: Data str :param destination_s3_path: URL for target S3 location
:param kwargs: Keyword arguments are passed to the boto3 function put_object
datetime before end_time :param return_key: Optional argument, when set to True will return boto3’s
ObjectSummary (instead of the filename)
list(path, start_time=None, end_time=None, return_key=False)
class luigi.contrib.s3.AtomicS3File(path, s3_client, **kwargs)
Bases: luigi.target.AtomicLocalFile
An S3 file that writes to a temp file and puts to S3 on close.
Parameters kwargs – Keyword arguments are passed to the boto function initi-
ate_multipart_upload
move_to_final_destination()
class luigi.contrib.s3.ReadableS3File(s3_key)
Bases: object
read(size=None)
close()
readable()
writable()
seekable()
class luigi.contrib.s3.S3Target(path, format=None, client=None, **kwargs)
Bases: luigi.target.FileSystemTarget
Target S3 file object
Parameters kwargs – Keyword arguments are passed to the boto function initi-
ate_multipart_upload
fs = None
open(mode=’r’)
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions. Using b is not supported; initialize with format=Nop instead.
class luigi.contrib.s3.S3FlagTarget(path, format=None, client=None, flag=’_SUCCESS’)
Bases: luigi.contrib.s3.S3Target
Defines a target directory with a flag-file (defaults to _SUCCESS) used to signify job success.
This checks for two things:
• the path exists (just like the S3Target)
• the _SUCCESS file exists within the directory.
Because Hadoop outputs into a directory and not a single file, the path is assumed to be a directory.
This is meant to be a handy alternative to AtomicS3File.
The AtomicFile approach can be burdensome for S3 since there are no directories, per se.
If we have 1,000,000 output files, then we have to rename 1,000,000 objects.
Initializes a S3FlagTarget.
Parameters
• path (str) – the directory where the files are stored.
• client –
• flag (str) –
fs = None
exists()
Returns True if the path for this FileSystemTarget exists; False otherwise.
This method is implemented by using fs.
class luigi.contrib.s3.S3EmrTarget(*args, **kwargs)
Bases: luigi.contrib.s3.S3FlagTarget
Deprecated. Use S3FlagTarget
class luigi.contrib.s3.S3PathTask(*args, **kwargs)
Bases: luigi.task.ExternalTask
A external task that to require existence of a path in S3.
path = Parameter
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
class luigi.contrib.s3.S3EmrTask(*args, **kwargs)
Bases: luigi.task.ExternalTask
An external task that requires the existence of EMR output in S3.
path = Parameter
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
class luigi.contrib.s3.S3FlagTask(*args, **kwargs)
Bases: luigi.task.ExternalTask
An external task that requires the existence of EMR output in S3.
path = Parameter
flag = OptionalParameter (defaults to None)
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
luigi.contrib.salesforce module
luigi.contrib.salesforce.get_soql_fields(soql)
Gets queried columns names.
luigi.contrib.salesforce.ensure_utf(value)
luigi.contrib.salesforce.parse_results(fields, data)
Traverses ordered dictionary, calls _traverse_results() to recursively read into the dictionary depth of data
class luigi.contrib.salesforce.salesforce(*args, **kwargs)
Bases: luigi.task.Config
Config system to get config vars from ‘salesforce’ section in configuration file.
Did not include sandbox_name here, as the user may have multiple sandboxes.
username = Parameter (defaults to )
password = Parameter (defaults to )
security_token = Parameter (defaults to )
sb_security_token = Parameter (defaults to )
class luigi.contrib.salesforce.QuerySalesforce(*args, **kwargs)
Bases: luigi.task.Task
object_name
Override to return the SF object we are querying. Must have the SF “__c” suffix if it is a customer object.
use_sandbox
Override to specify use of SF sandbox. True iff we should be uploading to a sandbox environment instead
of the production organization.
sandbox_name
Override to specify the sandbox name if it is intended to be used.
soql
Override to return the raw string SOQL or the path to it.
is_soql_file
Override to True if soql property is a file path.
content_type
Override to use a different content type. Salesforce allows XML, CSV, ZIP_CSV, or ZIP_XML. Defaults
to CSV.
run()
The task run method, to be overridden in a subclass.
See Task.run
merge_batch_results(result_ids)
Merges the resulting files of a multi-result batch bulk query.
class luigi.contrib.salesforce.SalesforceAPI(username, password, security_token,
sb_token=None, sandbox_name=None)
Bases: object
Class used to interact with the SalesforceAPI. Currently provides only the methods necessary for performing a
bulk upload operation.
API_VERSION = 34.0
SOAP_NS = '{urn:partner.soap.sforce.com}'
API_NS = '{http://www.force.com/2009/06/asyncapi/dataload}'
start_session()
Starts a Salesforce session and determines which SF instance to use for future requests.
has_active_session()
query(query, **kwargs)
Return the result of a Salesforce SOQL query as a dict decoded from the Salesforce response JSON pay-
load.
Parameters query – the SOQL query to send to Salesforce, e.g. “SELECT id from Lead
WHERE email = ‘[email protected]’”
query_more(next_records_identifier, identifier_is_url=False, **kwargs)
Retrieves more results from a query that returned more results than the batch maximum. Returns a dict
decoded from the Salesforce response JSON payload.
Parameters
• next_records_identifier – either the Id of the next Salesforce object in the result,
or a URL to the next record in the result.
• identifier_is_url – True if next_records_identifier should be treated as a URL,
False if next_records_identifer should be treated as an Id.
query_all(query, **kwargs)
Returns the full set of results for the query. This is a convenience wrapper around query(. . . ) and
query_more(. . . ). The returned dict is the decoded JSON payload from the final call to Salesforce, but
with the totalSize field representing the full number of results retrieved and the records list representing
the full list of records retrieved.
Parameters query – the SOQL query to send to Salesforce, e.g. SELECT Id FROM Lead
WHERE Email = “[email protected]”
restful(path, params)
Allows you to make a direct REST call if you know the path Arguments: :param path: The path of the
request. Example: sobjects/User/ABC123/password’ :param params: dict of parameters to pass to the path
create_operation_job(operation, obj, external_id_field_name=None, content_type=None)
Creates a new SF job that for doing any operation (insert, upsert, update, delete, query)
Parameters
• operation – delete, insert, query, upsert, update, hardDelete. Must be lowercase.
• obj – Parent SF object
• external_id_field_name – Optional.
get_job_details(job_id)
Gets all details for existing job
Parameters job_id – job_id as returned by ‘create_operation_job(. . . )’
Returns job info as xml
abort_job(job_id)
Abort an existing job. When a job is aborted, no more records are processed. Changes to data may already
have been committed and aren’t rolled back.
Parameters job_id – job_id as returned by ‘create_operation_job(. . . )’
Returns abort response as xml
close_job(job_id)
Closes job
Parameters job_id – job_id as returned by ‘create_operation_job(. . . )’
Returns close response as xml
create_batch(job_id, data, file_type)
Creates a batch with either a string of data or a file containing data.
If a file is provided, this will pull the contents of the file_target into memory when running. That shouldn’t
be a problem for any files that meet the Salesforce single batch upload size limit (10MB) and is done to
ensure compressed files can be uploaded properly.
Parameters
• job_id – job_id as returned by ‘create_operation_job(. . . )’
• data –
Returns Returns batch_id
block_on_batch(job_id, batch_id, sleep_time_seconds=5, max_wait_time_seconds=-1)
Blocks until @batch_id is completed or failed. :param job_id: :param batch_id: :param
sleep_time_seconds: :param max_wait_time_seconds:
get_batch_results(job_id, batch_id)
DEPRECATED: Use get_batch_result_ids
get_batch_result_ids(job_id, batch_id)
Get result IDs of a batch that has completed processing.
Parameters
• job_id – job_id as returned by ‘create_operation_job(. . . )’
• batch_id – batch_id as returned by ‘create_batch(. . . )’
Returns list of batch result IDs to be used in ‘get_batch_result(. . . )’
get_batch_result(job_id, batch_id, result_id)
Gets result back from Salesforce as whatever type was originally sent in create_batch (xml, or csv). :param
job_id: :param batch_id: :param result_id:
luigi.contrib.scalding module
# provided dependencies, e.g. jars required for compiling but not executing
# scalding jobs. Currently required jars:
# org.apache.hadoop/hadoop-core/0.20.2
# org.slf4j/slf4j-log4j12/1.6.6
# log4j/log4j/1.2.15
# commons-httpclient/commons-httpclient/3.1
# commons-cli/commons-cli/1.2
# org.apache.zookeeper/zookeeper/3.3.4
scalding-provided: /usr/share/scalding/provided
class luigi.contrib.scalding.ScaldingJobRunner
Bases: luigi.contrib.hadoop.JobRunner
JobRunner for pyscald commands. Used to run a ScaldingJobTask.
get_scala_jars(include_compiler=False)
get_scalding_jars()
get_scalding_core()
get_provided_jars()
get_libjars()
get_tmp_job_jar(source)
get_build_dir(source)
get_job_class(source)
build_job_jar(job)
run_job(job, tracking_url_callback=None)
class luigi.contrib.scalding.ScaldingJobTask(*args, **kwargs)
Bases: luigi.contrib.hadoop.BaseHadoopJobTask
A job task for Scalding that define a scala source and (optional) main method.
requires() should return a dictionary where the keys are Scalding argument names and values are sub tasks or
lists of subtasks.
For example:
{'input1': A, 'input2': C} => --input1 <Aoutput> --input2 <Coutput>
{'input1': [A, B], 'input2': [C]} => --input1 <Aoutput> <Boutput> --input2
˓→<Coutput>
relpath(current_file, rel_path)
Compute path given current file and relative path.
source()
Path to the scala source for this Scalding Job
Either one of source() or jar() must be specified.
jar()
Path to the jar file for this Scalding Job
Either one of source() or jar() must be specified.
extra_jars()
Extra jars for building and running this Scalding Job.
job_class()
optional main job class for this Scalding Job.
job_runner()
atomic_output()
If True, then rewrite output arguments to be temp locations and atomically move them into place after the
job finishes.
requires()
The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any
other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method
to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
job_args()
Extra arguments to pass to the Scalding job.
args()
Returns an array of args to pass to the job.
luigi.contrib.sge module
import logging
import luigi
import os
from luigi.contrib.sge import SGEJobTask
logger = logging.getLogger('luigi-interface')
i = luigi.Parameter()
def work(self):
logger.info('Running test job...')
with open(self.output().path, 'w') as f:
f.write('this is a test')
def output(self):
return luigi.LocalTarget(os.path.join('/home', 'testfile_' + str(self.i)))
if __name__ == '__main__':
tasks = [TestJobTask(i=str(i), n_cpu=i+1) for i in range(3)]
luigi.build(tasks, local_scheduler=True, workers=3)
The n-cpu parameter allows you to define different compute resource requirements (or slots, in SGE terms) for each
task. In this example, the third Task asks for 3 CPU slots. If your cluster only contains nodes with 2 CPUs, this
task will hang indefinitely in the queue. See the docs for luigi.contrib.sge.SGEJobTask for other SGE
parameters. As for any task, you can also set these in your luigi configuration file as shown below. The default values
below were matched to the values used by MIT StarCluster, an open-source SGE cluster manager for use with Amazon
EC2:
[SGEJobTask]
shared-tmp-dir = /home
parallel-env = orte
n-cpu = 2
• no_tarball: Don’t create a tarball of the luigi project directory. Can be useful to reduce I/O require-
ments when the luigi directory is accessible from cluster nodes already.
n_cpu = Insignificant IntParameter (defaults to 2)
shared_tmp_dir = Insignificant Parameter (defaults to /home)
parallel_env = Insignificant Parameter (defaults to orte)
job_name_format = Insignificant Parameter (defaults to None): A string that can be for
run_locally = Insignificant BoolParameter (defaults to False): run locally instead of
poll_time = Insignificant IntParameter (defaults to 5): specify the wait time to poll
dont_remove_tmp_dir = Insignificant BoolParameter (defaults to False): don't delete th
no_tarball = Insignificant BoolParameter (defaults to False): don't tarball (and extra
job_name = Insignificant Parameter (defaults to None): Explicit job name given via qsu
run()
The task run method, to be overridden in a subclass.
See Task.run
work()
Override this method, rather than run(), for your actual work.
class luigi.contrib.sge.LocalSGEJobTask(*args, **kwargs)
Bases: luigi.contrib.sge.SGEJobTask
A local version of SGEJobTask, for easier debugging.
This version skips the qsub steps and simply runs work() on the local node, so you don’t need to be on an
SGE cluster to use your Task in a test workflow.
run()
The task run method, to be overridden in a subclass.
See Task.run
luigi.contrib.sge_runner module
luigi.contrib.simulate module
luigi.contrib.spark module
master
deploy_mode
jars
packages
py_files
files
conf
properties_file
driver_memory
driver_java_options
driver_library_path
driver_class_path
executor_memory
driver_cores
supervise
total_executor_cores
executor_cores
queue
num_executors
archives
hadoop_conf_dir
get_environment()
program_environment()
Override this method to control environment variables for the program
Returns dict mapping environment variable names to values
program_args()
Override this method to map your task parameters to the program arguments
Returns list to pass as args to subprocess.Popen
spark_command()
app_command()
class luigi.contrib.spark.PySparkTask(*args, **kwargs)
Bases: luigi.contrib.spark.SparkSubmitTask
Template task for running an inline PySpark job
Simply implement the main method in your subclass
You can optionally define package names to be distributed to the cluster with py_packages (uses luigi’s
global py-packages configuration by default)
app = '/home/docs/checkouts/readthedocs.org/user_builds/luigi/envs/stable/lib/python2.7
name
py_packages
files
setup(conf )
Called by the pyspark_runner with a SparkConf instance that will be used to instantiate the SparkContext
Parameters conf – SparkConf
setup_remote(sc)
main(sc, *args)
Called by the pyspark_runner with a SparkContext and any arguments returned by app_options()
Parameters
• sc – SparkContext
• args – arguments list
app_command()
run()
The task run method, to be overridden in a subclass.
See Task.run
luigi.contrib.sparkey module
luigi.contrib.sqla module
Support for SQLAlchmey. Provides SQLAlchemyTarget for storing in databases supported by SQLAlchemy. The
user would be responsible for installing the required database driver to connect using SQLAlchemy.
Minimal example of a job to copy data to database using SQLAlchemy is as shown below:
from sqlalchemy import String
import luigi
from luigi.contrib import sqla
class SQLATask(sqla.CopyToTable):
(continues on next page)
def rows(self):
for row in [("item1", "property1"), ("item2", "property2")]:
yield row
if __name__ == '__main__':
task = SQLATask()
luigi.build([task], local_scheduler=True)
If the target table where the data needs to be copied already exists, then the column schema definition can be skipped
and instead the reflect flag can be set as True. Here is a modified version of the above example:
class SQLATask(sqla.CopyToTable):
# If database table is already created, then the schema can be loaded
# by setting the reflect flag to True
reflect = True
connection_string = "sqlite://" # in memory SQLite database
table = "item_property" # name of the table to store data
def rows(self):
for row in [("item1", "property1"), ("item2", "property2")]:
yield row
if __name__ == '__main__':
task = SQLATask()
luigi.build([task], local_scheduler=True)
In the above examples, the data that needs to be copied was directly provided by overriding the rows method. Alter-
nately, if the data comes from another task, the modified example would look as shown below:
class BaseTask(luigi.Task):
def output(self):
return MockTarget("BaseTask")
def run(self):
out = self.output().open("w")
TASK_LIST = ["item%d\tproperty%d\n" % (i, i) for i in range(10)]
for task in TASK_LIST:
(continues on next page)
class SQLATask(sqla.CopyToTable):
# columns defines the table schema, with each element corresponding
# to a column in the format (args, kwargs) which will be sent to
# the sqlalchemy.Column(*args, **kwargs)
columns = [
(["item", String(64)], {"primary_key": True}),
(["property", String(64)], {})
]
connection_string = "sqlite://" # in memory SQLite database
table = "item_property" # name of the table to store data
def requires(self):
return BaseTask()
if __name__ == '__main__':
task1, task2 = SQLATask(), BaseTask()
luigi.build([task1, task2], local_scheduler=True)
In the above example, the output from BaseTask is copied into the database. Here we did not have to implement
the rows method because by default rows implementation assumes every line is a row with column values separated
by a tab. One can define column_separator option for the task if the values are say comma separated instead of tab
separated.
You can pass in database specific connection arguments by setting the connect_args dictionary. The options will be
passed directly to the DBAPI’s connect method as keyword arguments.
The other option to sqla.CopyToTable that can be of help with performance aspect is the chunk_size. The default is
5000. This is the number of rows that will be inserted in a transaction at a time. Depending on the size of the inserts,
this value can be tuned for performance.
See here for a tutorial on building task pipelines using luigi and using SQLAlchemy in workflow pipelines.
Author: Gouthaman Balaraman Date: 01/02/2015
class luigi.contrib.sqla.SQLAlchemyTarget(connection_string, target_table, update_id,
echo=False, connect_args=None)
Bases: luigi.target.Target
Database target using SQLAlchemy.
This will rarely have to be directly instantiated by the user.
Typical usage would be to override luigi.contrib.sqla.CopyToTable class to create a task to write to the database.
Constructor for the SQLAlchemyTarget.
Parameters
• connection_string (str) – SQLAlchemy connection string
• target_table (str) – The table name for the data
• update_id (str) – An identifier for this data set
• echo (bool) – Flag to setup SQLAlchemy logging
• connect_args (dict) – A dictionary of connection arguments
Returns
marker_table = None
class Connection(engine, pid)
Bases: tuple
Create new instance of Connection(engine, pid)
engine
Alias for field number 0
pid
Alias for field number 1
engine
Return an engine instance, creating it if it doesn’t exist.
Recreate the engine connection if it wasn’t originally created by the current process.
touch()
Mark this update as complete.
exists()
Returns True if the Target exists and False otherwise.
create_marker_table()
Create marker table if it doesn’t exist.
Using a separate connection since the transaction might have to be reset.
open(mode)
class luigi.contrib.sqla.CopyToTable(*args, **kwargs)
Bases: luigi.task.Task
An abstract task for inserting a data set into SQLAlchemy RDBMS
Usage:
• subclass and override the required connection_string, table and columns attributes.
• optionally override the schema attribute to use a different schema for the target table.
echo = False
connect_args = {}
connection_string
table
columns = []
schema = ''
column_separator = '\t'
chunk_size = 5000
reflect = False
create_table(engine)
Override to provide code for creating the target table.
By default it will be created using types specified in columns. If the table exists, then it binds to the existing
table.
If overridden, use the provided connection object for setting up the table in order to create the table and
insert data using the same transaction. :param engine: The sqlalchemy engine instance :type engine: object
update_id()
This update id will be a unique identifier for this insert on this table.
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
rows()
Return/yield tuples or lists corresponding to each row to be inserted.
This method can be overridden for custom file types or formats.
run()
The task run method, to be overridden in a subclass.
See Task.run
copy(conn, ins_rows, table_bound)
This method does the actual insertion of the rows of data given by ins_rows into the database. A task
that needs row updates instead of insertions should overload this method. :param conn: The sqlalchemy
connection object :param ins_rows: The dictionary of rows with the keys in the format _<column_name>.
For example if you have a table with a column name “property”, then the key in the dictionary would be
“_property”. This format is consistent with the bindparam usage in sqlalchemy. :param table_bound: The
object referring to the table :return:
luigi.contrib.ssh module
check_output(cmd)
Execute a shell command remotely and return the output.
Simplified version of Popen when you only want the output as a string and detect any errors.
tunnel(**kwds)
Open a tunnel between localhost:local_port and remote_host:remote_port via the host specified by this
context.
Remember to close() the returned “tunnel” object in order to clean up after yourself when you are done
with the tunnel.
class luigi.contrib.ssh.RemoteFileSystem(host, **kwargs)
Bases: luigi.target.FileSystem
exists(path)
Return True if file or directory at path exist, False otherwise.
listdir(path)
Return a list of files rooted in path.
This returns an iterable of the files rooted at path. This is intended to be a recursive listing.
Parameters path (str) – a path within the FileSystem to list.
Note: This method is optional, not all FileSystem subclasses implements it.
isdir(path)
Return True if directory at path exist, False otherwise.
remove(path, recursive=True)
Remove file or directory at location path.
mkdir(path, parents=True, raise_if_exists=False)
Create directory at location path
Creates the directory at path and implicitly create parent directories if they do not already exist.
Parameters
• path (str) – a path within the FileSystem to create as a directory.
• parents (bool) – Create parent directories when necessary. When parents=False and
the parent directory doesn’t exist, raise luigi.target.MissingParentDirectory
• raise_if_exists (bool) – raise luigi.target.FileAlreadyExists if the folder already
exists.
put(local_path, path)
get(path, local_path)
class luigi.contrib.ssh.AtomicRemoteFileWriter(fs, path)
Bases: luigi.format.OutputPipeProcessWrapper
close()
tmp_path
fs
class luigi.contrib.ssh.RemoteTarget(path, host, format=None, **kwargs)
Bases: luigi.target.FileSystemTarget
Target used for reading from remote files.
The target is implemented using ssh commands streaming data over the network.
fs
open(mode=’r’)
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions. Using b is not supported; initialize with format=Nop instead.
put(local_path)
get(local_path)
luigi.contrib.target module
luigi.contrib.webhdfs module
move_to_final_destination()
Module contents
luigi.tools package
Submodules
luigi.tools.deps module
luigi.tools.deps.get_task_requires(task)
luigi.tools.deps.dfs_paths(start_task, goal_task_family, path=None)
class luigi.tools.deps.upstream(*args, **kwargs)
Bases: luigi.task.Config
Used to provide the parameter upstream-family
family = OptionalParameter (defaults to None)
luigi.tools.deps.find_deps(task, upstream_task_family)
Finds all dependencies that start with the given task and have a path to upstream_task_family
Returns all deps on all paths between task and upstream
luigi.tools.deps.find_deps_cli()
Finds all tasks on all paths from provided CLI task
luigi.tools.deps.get_task_output_description(task_output)
Returns a task’s output as a string
luigi.tools.deps.main()
luigi.tools.deps_tree module
This module parses commands exactly the same as the luigi task runner. You must specify the module, the task and
task paramters. Instead of executing a task, this module prints the significant paramters and state of the task and its
dependencies in a tree format. Use this to visualize the execution plan in the terminal.
class luigi.tools.deps_tree.bcolors
colored output for task status
OKBLUE = '\x1b[94m'
OKGREEN = '\x1b[92m'
ENDC = '\x1b[0m'
luigi.tools.deps_tree.print_tree(task, indent=”, last=True)
Return a string representation of the tasks, their statuses/parameters in a dependency tree format
luigi.tools.deps_tree.main()
luigi.tools.luigi_grep module
luigi.tools.range module
DELAY = 'event.tools.range.delay'
class luigi.tools.range.RangeBase(*args, **kwargs)
Bases: luigi.task.WrapperTask
Produces a contiguous completed range of a recurring task.
Made for the common use case where a task is parameterized by e.g. DateParameter, and assurance is
needed that any gaps arising from downtime are eventually filled.
Emits events that one can use to monitor gaps and delays.
At least one of start and stop needs to be specified.
(This is quite an abstract base class for subclasses with different datetime parameter classes, e.g.
DateParameter, DateHourParameter, . . . , and different parameter naming, e.g. days_back/forward,
hours_back/forward, . . . , as well as different documentation wording, to improve user experience.)
Subclasses will need to use the of parameter when overriding methods.
of = TaskParameter: task name to be completed. The task must take a single datetime pa
of_params = DictParameter (defaults to {}): Arguments to be provided to the 'of' class
start = Parameter
stop = Parameter
reverse = BoolParameter (defaults to False): specifies the preferred order for catchin
task_limit = IntParameter (defaults to 50): how many of 'of' tasks to require. Guards
now = IntParameter (defaults to None): set to override current time. In seconds since
param_name = Parameter (defaults to None): parameter name used to pass in parameterize
of_cls
DONT USE. Will be deleted soon. Use self.of!
datetime_to_parameter(dt)
parameter_to_datetime(p)
datetime_to_parameters(dt)
Given a date-time, will produce a dictionary of of-params combined with the ranged task parameter
parameters_to_datetime(p)
Given a dictionary of parameters, will extract the ranged task parameter value
moving_start(now)
Returns a datetime from which to ensure contiguousness in the case when start is None or unfeasibly far
back.
moving_stop(now)
Returns a datetime till which to ensure contiguousness in the case when stop is None or unfeasibly far
forward.
finite_datetimes(finite_start, finite_stop)
Returns the individual datetimes in interval [finite_start, finite_stop) for which task completeness should
be required, as a sorted list.
requires()
The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any
other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method
to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
missing_datetimes(finite_datetimes)
Override in subclasses to do bulk checks.
Returns a sorted list.
This is a conservative base implementation that brutally checks completeness, instance by instance.
Inadvisable as it may be slow.
class luigi.tools.range.RangeDailyBase(*args, **kwargs)
Bases: luigi.tools.range.RangeBase
Produces a contiguous completed range of a daily recurring task.
start = DateParameter (defaults to None): beginning date, inclusive. Default: None -
stop = DateParameter (defaults to None): ending date, exclusive. Default: None - work
days_back = IntParameter (defaults to 100): extent to which contiguousness is to be as
days_forward = IntParameter (defaults to 0): extent to which contiguousness is to be a
datetime_to_parameter(dt)
parameter_to_datetime(p)
datetime_to_parameters(dt)
Given a date-time, will produce a dictionary of of-params combined with the ranged task parameter
parameters_to_datetime(p)
Given a dictionary of parameters, will extract the ranged task parameter value
moving_start(now)
Returns a datetime from which to ensure contiguousness in the case when start is None or unfeasibly far
back.
moving_stop(now)
Returns a datetime till which to ensure contiguousness in the case when stop is None or unfeasibly far
forward.
finite_datetimes(finite_start, finite_stop)
Simply returns the points in time that correspond to turn of day.
class luigi.tools.range.RangeHourlyBase(*args, **kwargs)
Bases: luigi.tools.range.RangeBase
Produces a contiguous completed range of an hourly recurring task.
start = DateHourParameter (defaults to None): beginning datehour, inclusive. Default:
stop = DateHourParameter (defaults to None): ending datehour, exclusive. Default: Non
hours_back = IntParameter (defaults to 2400): extent to which contiguousness is to be
hours_forward = IntParameter (defaults to 0): extent to which contiguousness is to be
datetime_to_parameter(dt)
parameter_to_datetime(p)
datetime_to_parameters(dt)
Given a date-time, will produce a dictionary of of-params combined with the ranged task parameter
parameters_to_datetime(p)
Given a dictionary of parameters, will extract the ranged task parameter value
moving_start(now)
Returns a datetime from which to ensure contiguousness in the case when start is None or unfeasibly far
back.
moving_stop(now)
Returns a datetime till which to ensure contiguousness in the case when stop is None or unfeasibly far
forward.
finite_datetimes(finite_start, finite_stop)
Simply returns the points in time that correspond to whole hours.
class luigi.tools.range.RangeByMinutesBase(*args, **kwargs)
Bases: luigi.tools.range.RangeBase
Produces a contiguous completed range of an recurring tasks separated a specified number of minutes.
start = DateMinuteParameter (defaults to None): beginning date-hour-minute, inclusive.
stop = DateMinuteParameter (defaults to None): ending date-hour-minute, exclusive. Def
minutes_back = IntParameter (defaults to 1440): extent to which contiguousness is to b
minutes_forward = IntParameter (defaults to 0): extent to which contiguousness is to b
minutes_interval = IntParameter (defaults to 1): separation between events in minutes.
datetime_to_parameter(dt)
parameter_to_datetime(p)
datetime_to_parameters(dt)
Given a date-time, will produce a dictionary of of-params combined with the ranged task parameter
parameters_to_datetime(p)
Given a dictionary of parameters, will extract the ranged task parameter value
moving_start(now)
Returns a datetime from which to ensure contiguousness in the case when start is None or unfeasibly far
back.
moving_stop(now)
Returns a datetime till which to ensure contiguousness in the case when stop is None or unfeasibly far
forward.
finite_datetimes(finite_start, finite_stop)
Simply returns the points in time that correspond to a whole number of minutes intervals.
luigi.tools.range.most_common(items)
Wanted functionality from Counters (new in Python 2.7).
luigi.tools.range.infer_bulk_complete_from_fs(datetimes, datetime_to_task, date-
time_to_re)
Efficiently determines missing datetimes by filesystem listing.
The current implementation works for the common case of a task writing output to a FileSystemTarget
whose path is built using strftime with format like ‘. . . %Y. . . %m. . . %d. . . %H. . . ’, without custom
complete() or exists().
(Eventually Luigi could have ranges of completion as first-class citizens. Then this listing business could be
factored away/be provided for explicitly in target API or some kind of a history server.)
class luigi.tools.range.RangeMonthly(*args, **kwargs)
Bases: luigi.tools.range.RangeBase
Produces a contiguous completed range of a monthly recurring task.
Unlike the Range* classes with shorter intervals, this class does not perform bulk optimisation. It is assumed
that the number of months is low enough not to motivate the increased complexity. Hence, there is no class
RangeMonthlyBase.
start = MonthParameter (defaults to None): beginning month, inclusive. Default: None
stop = MonthParameter (defaults to None): ending month, exclusive. Default: None - wo
months_back = IntParameter (defaults to 13): extent to which contiguousness is to be a
months_forward = IntParameter (defaults to 0): extent to which contiguousness is to be
datetime_to_parameter(dt)
parameter_to_datetime(p)
datetime_to_parameters(dt)
Given a date-time, will produce a dictionary of of-params combined with the ranged task parameter
parameters_to_datetime(p)
Given a dictionary of parameters, will extract the ranged task parameter value
moving_start(now)
Returns a datetime from which to ensure contiguousness in the case when start is None or unfeasibly far
back.
moving_stop(now)
Returns a datetime till which to ensure contiguousness in the case when stop is None or unfeasibly far
forward.
finite_datetimes(finite_start, finite_stop)
Simply returns the points in time that correspond to turn of month.
class luigi.tools.range.RangeDaily(*args, **kwargs)
Bases: luigi.tools.range.RangeDailyBase
Efficiently produces a contiguous completed range of a daily recurring task that takes a single
DateParameter.
Falls back to infer it from output filesystem listing to facilitate the common case usage.
Convenient to use even from command line, like:
missing_datetimes(finite_datetimes)
Override in subclasses to do bulk checks.
Returns a sorted list.
This is a conservative base implementation that brutally checks completeness, instance by instance.
Inadvisable as it may be slow.
class luigi.tools.range.RangeHourly(*args, **kwargs)
Bases: luigi.tools.range.RangeHourlyBase
Efficiently produces a contiguous completed range of an hourly recurring task that takes a single
DateHourParameter.
Benefits from bulk_complete information to efficiently cover gaps.
Falls back to infer it from output filesystem listing to facilitate the common case usage.
Convenient to use even from command line, like:
missing_datetimes(finite_datetimes)
Override in subclasses to do bulk checks.
Returns a sorted list.
This is a conservative base implementation that brutally checks completeness, instance by instance.
Inadvisable as it may be slow.
class luigi.tools.range.RangeByMinutes(*args, **kwargs)
Bases: luigi.tools.range.RangeByMinutesBase
Efficiently produces a contiguous completed range of an recurring task every interval minutes that takes a single
DateMinuteParameter.
Benefits from bulk_complete information to efficiently cover gaps.
Falls back to infer it from output filesystem listing to facilitate the common case usage.
Convenient to use even from command line, like:
missing_datetimes(finite_datetimes)
Override in subclasses to do bulk checks.
Returns a sorted list.
This is a conservative base implementation that brutally checks completeness, instance by instance.
Inadvisable as it may be slow.
Module contents
Sort of a standard library for doing stuff with Tasks at a somewhat abstract level.
Submodule introduced to stop growing util.py unstructured.
9.1.2 Submodules
luigi.batch_notifier module
Library for sending batch notifications from the Luigi scheduler. This module is internal to Luigi and not designed for
use in other contexts.
class luigi.batch_notifier.batch_email(*args, **kwargs)
Bases: luigi.task.Config
email_interval = IntParameter (defaults to 60): Number of minutes between e-mail sends
luigi.cmdline module
luigi.cmdline_parser module
This module contains luigi internal parsing logic. Things exposed here should be considered internal to luigi.
class luigi.cmdline_parser.CmdlineParser(cmdline_args)
Bases: object
Helper for parsing command line arguments and used as part of the context when instantiating task objects.
Normal luigi users should just use luigi.run().
Initialize cmd line args
classmethod get_instance()
Singleton getter
classmethod global_instance(**kwds)
Meant to be used as a context manager.
get_task_obj()
Get the task object
luigi.date_interval module
luigi.date_interval provides convenient classes for date algebra. Everything uses ISO 8601 notation, i.e.
YYYY-MM-DD for dates, etc. There is a corresponding luigi.parameter.DateIntervalParameter that
you can use to parse date intervals.
Example:
class MyTask(luigi.Task):
date_interval = luigi.DateIntervalParameter()
Now, you can launch this from the command line using --date-interval 2014-05-10 or
--date-interval 2014-W26 (using week notation) or --date-interval 2014 (for a year) and
some other notations.
class luigi.date_interval.DateInterval(date_a, date_b)
Bases: object
The DateInterval is the base class with subclasses Date, Week, Month, Year, and Custom. Note that
the DateInterval is abstract and should not be used directly: use Custom for arbitrary date intervals. The
base class features a couple of convenience methods, such as next() which returns the next consecutive date
interval.
Example:
x = luigi.date_interval.Week(2013, 52)
print x.prev()
Custom date interval (does not implement prev and next methods)
Actually the ISO 8601 specifies <start>/<end> as the time interval format Not sure if this goes for date intervals
as well. In any case slashes will most likely cause problems with paths etc.
to_string()
classmethod parse(s)
Abstract class method.
For instance, Year.parse("2014") returns a Year(2014).
luigi.db_task_history module
Provides a database backend to the central scheduler. This lets you see historical runs. See Enabling Task History for
information about how to turn out the task history feature.
class luigi.db_task_history.DbTaskHistory
Bases: luigi.task_history.TaskHistory
Task History that writes to a database using sqlalchemy. Also has methods for useful db queries.
CURRENT_SOURCE_VERSION = 1
task_scheduled(task)
task_finished(task, successful)
task_started(task, worker_host)
find_all_by_parameters(task_name, session=None, **task_params)
Find tasks with the given task_name and the same parameters as the kwargs.
find_all_by_name(task_name, session=None)
Find all tasks with the given task_name.
find_latest_runs(session=None)
Return tasks that have been updated in the past 24 hours.
find_all_runs(session=None)
Return all tasks that have been updated.
find_all_events(session=None)
Return all running/failed/done events.
find_task_by_id(id, session=None)
Find task with the given record ID.
class luigi.db_task_history.TaskParameter(**kwargs)
Bases: sqlalchemy.ext.declarative.api.Base
Table to track luigi.Parameter()s of a Task.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in kwargs.
Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any
mapped columns or relationships.
task_id
name
value
class luigi.db_task_history.TaskEvent(**kwargs)
Bases: sqlalchemy.ext.declarative.api.Base
Table to track when a task is scheduled, starts, finishes, and fails.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in kwargs.
Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any
mapped columns or relationships.
id
task_id
event_name
ts
class luigi.db_task_history.TaskRecord(**kwargs)
Bases: sqlalchemy.ext.declarative.api.Base
Base table to track information about a luigi.Task.
References to other tables are available through task.events, task.parameters, etc.
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in kwargs.
Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any
mapped columns or relationships.
id
task_id
name
host
parameters
events
luigi.event module
Definitions needed for events. See Events and callbacks for info on how to use it.
class luigi.event.Event
Bases: object
DEPENDENCY_DISCOVERED = 'event.core.dependency.discovered'
DEPENDENCY_MISSING = 'event.core.dependency.missing'
DEPENDENCY_PRESENT = 'event.core.dependency.present'
BROKEN_TASK = 'event.core.task.broken'
START = 'event.core.start'
PROGRESS = 'event.core.progress'
This event can be fired by the task itself while running. The purpose is for the task to report progress,
metadata or any generic info so that event handler listening for this can keep track of the progress of
running task.
FAILURE = 'event.core.failure'
SUCCESS = 'event.core.success'
PROCESSING_TIME = 'event.core.processing_time'
TIMEOUT = 'event.core.timeout'
PROCESS_FAILURE = 'event.core.process_failure'
luigi.execution_summary module
This module provide the function summary() that is used for printing an execution summary at the end of luigi
invocations.
class luigi.execution_summary.execution_summary(*args, **kwargs)
Bases: luigi.task.Config
summary_length = IntParameter (defaults to 5)
class luigi.execution_summary.LuigiStatusCode
Bases: enum.Enum
All possible status codes for the attribute status in LuigiRunResult when the argument
detailed_summary=True in luigi.run() / luigi.build. Here are the codes and what they mean:
luigi.format module
class luigi.format.FileWrapper(file_object)
Bases: object
Wrap file in a “real” so stuff can be added to it after creation.
class luigi.format.InputPipeProcessWrapper(command, input_pipe=None)
Bases: object
Initializes a InputPipeProcessWrapper instance.
Parameters command – a subprocess.Popen instance with stdin=input_pipe and std-
out=subprocess.PIPE. Alternatively, just its args argument as a convenience.
create_subprocess(command)
http://www.chiark.greenend.org.uk/ucgi/~cjwatson/blosxom/2009-07-02-python-sigpipe.html
close()
readable()
writable()
seekable()
class luigi.format.OutputPipeProcessWrapper(command, output_pipe=None)
Bases: object
WRITES_BEFORE_FLUSH = 10000
write(*args, **kwargs)
writeLine(line)
close()
abort()
readable()
writable()
seekable()
class luigi.format.BaseWrapper(stream, *args, **kwargs)
Bases: object
class luigi.format.NewlineWrapper(stream, newline=None)
Bases: luigi.format.BaseWrapper
read(n=-1)
writelines(lines)
write(b)
class luigi.format.MixedUnicodeBytesWrapper(stream, encoding=None)
Bases: luigi.format.BaseWrapper
write(b)
writelines(lines)
class luigi.format.Format
Bases: object
Interface for format specifications.
classmethod pipe_reader(input_pipe)
classmethod pipe_writer(output_pipe)
class luigi.format.ChainFormat(*args, **kwargs)
Bases: luigi.format.Format
pipe_reader(input_pipe)
pipe_writer(output_pipe)
class luigi.format.TextWrapper(stream, *args, **kwargs)
Bases: _io.TextIOWrapper
class luigi.format.NopFormat
Bases: luigi.format.Format
pipe_reader(input_pipe)
pipe_writer(output_pipe)
class luigi.format.WrappedFormat(*args, **kwargs)
Bases: luigi.format.Format
pipe_reader(input_pipe)
pipe_writer(output_pipe)
class luigi.format.TextFormat(*args, **kwargs)
Bases: luigi.format.WrappedFormat
input = 'unicode'
output = 'bytes'
wrapper_cls
alias of TextWrapper
class luigi.format.MixedUnicodeBytesFormat(*args, **kwargs)
Bases: luigi.format.WrappedFormat
output = 'bytes'
wrapper_cls
alias of MixedUnicodeBytesWrapper
class luigi.format.NewlineFormat(*args, **kwargs)
Bases: luigi.format.WrappedFormat
input = 'bytes'
output = 'bytes'
wrapper_cls
alias of NewlineWrapper
class luigi.format.GzipFormat(compression_level=None)
Bases: luigi.format.Format
input = 'bytes'
output = 'bytes'
pipe_reader(input_pipe)
pipe_writer(output_pipe)
class luigi.format.Bzip2Format
Bases: luigi.format.Format
input = 'bytes'
output = 'bytes'
pipe_reader(input_pipe)
pipe_writer(output_pipe)
luigi.format.get_default_format()
luigi.freezing module
luigi.interface module
This module contains the bindings for command line integration and dynamic loading of tasks
If you don’t want to run luigi from the command line. You may use the methods defined in this module to programat-
ically run luigi.
class luigi.interface.core(*args, **kwargs)
Bases: luigi.task.Config
Keeps track of a bunch of environment params.
Uses the internal luigi parameter mechanism. The nice thing is that we can instantiate this class and get an
object with all the environment variables set. This is arguably a bit of a hack.
use_cmdline_section = False
local_scheduler = BoolParameter (defaults to False): Use an in-memory central schedule
scheduler_host = Parameter (defaults to localhost): Hostname of machine running remote
scheduler_port = IntParameter (defaults to 8082): Port of remote scheduler api process
scheduler_url = Parameter (defaults to ): Full path to remote scheduler
lock_size = IntParameter (defaults to 1): Maximum number of workers running the same c
no_lock = BoolParameter (defaults to False): Ignore if similar process is already runn
lock_pid_dir = Parameter (defaults to /tmp/luigi): Directory to store the pid file
take_lock = BoolParameter (defaults to False): Signal other processes to stop getting
workers = IntParameter (defaults to 1): Maximum number of parallel tasks to run
logging_conf_file = Parameter (defaults to ): Configuration file for logging
log_level = ChoiceParameter (defaults to DEBUG): Default log level to use when logging_
module = Parameter (defaults to ): Used for dynamic loading of modules
parallel_scheduling = BoolParameter (defaults to False): Use multiprocessing to do sch
parallel_scheduling_processes = IntParameter (defaults to 0): The number of processes
assistant = BoolParameter (defaults to False): Run any task from the scheduler.
help = BoolParameter (defaults to False): Show most common flags and all task-specific
help_all = BoolParameter (defaults to False): Show all command line flags
exception luigi.interface.PidLockAlreadyTakenExit
Bases: exceptions.SystemExit
The exception thrown by luigi.run(), when the lock file is inaccessible
luigi.interface.run(*args, **kwargs)
Please dont use. Instead use luigi binary.
Run from cmdline using argparse.
Parameters use_dynamic_argparse – Deprecated and ignored
luigi.interface.build(tasks, worker_scheduler_factory=None, detailed_summary=False,
**env_params)
Run internally, bypassing the cmdline parsing.
Useful if you have some luigi code that you want to run internally. Example:
One notable difference is that build defaults to not using the identical process lock. Otherwise, build would only
be callable once from each process.
Parameters
• tasks –
• worker_scheduler_factory –
• env_params –
Returns True if there were no scheduling errors, even if tasks may fail.
luigi.local_target module
LocalTarget provides a concrete implementation of a Target class that uses files on the local file system
class luigi.local_target.atomic_file(path)
Bases: luigi.target.AtomicLocalFile
Simple class that writes to a temp file and moves it on close() Also cleans up the temp file if close is not invoked
move_to_final_destination()
generate_tmp_path(path)
class luigi.local_target.LocalFileSystem
Bases: luigi.target.FileSystem
Wrapper for access to file system operations.
Work in progress - add things as needed.
copy(old_path, new_path, raise_if_exists=False)
Copy a file or a directory with contents. Currently, LocalFileSystem and MockFileSystem support only
single file copying but S3Client copies either a file or a directory as required.
exists(path)
Return True if file or directory at path exist, False otherwise
Parameters path (str) – a path within the FileSystem to check for existence.
mkdir(path, parents=True, raise_if_exists=False)
Create directory at location path
Creates the directory at path and implicitly create parent directories if they do not already exist.
Parameters
• path (str) – a path within the FileSystem to create as a directory.
• parents (bool) – Create parent directories when necessary. When parents=False and
the parent directory doesn’t exist, raise luigi.target.MissingParentDirectory
• raise_if_exists (bool) – raise luigi.target.FileAlreadyExists if the folder already
exists.
isdir(path)
Return True if the location at path is a directory. If not, return False.
Parameters path (str) – a path within the FileSystem to check as a directory.
Note: This method is optional, not all FileSystem subclasses implements it.
listdir(path)
Return a list of files rooted in path.
This returns an iterable of the files rooted at path. This is intended to be a recursive listing.
Parameters path (str) – a path within the FileSystem to list.
Note: This method is optional, not all FileSystem subclasses implements it.
remove(path, recursive=True)
Remove file or directory at location path
Parameters
• path (str) – a path within the FileSystem to remove.
• recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to True.
move(old_path, new_path, raise_if_exists=False)
Move file atomically. If source and destination are located on different filesystems, atomicity is approxi-
mated but cannot be guaranteed.
rename_dont_move(path, dest)
Rename path to dest, but don’t move it into the dest folder (if it is a folder). This method is just a
wrapper around the move method of LocalTarget.
class luigi.local_target.LocalTarget(path=None, format=None, is_tmp=False)
Bases: luigi.target.FileSystemTarget
fs = <luigi.local_target.LocalFileSystem object>
makedirs()
Create all parent folders if they do not exist.
open(mode=’r’)
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions. Using b is not supported; initialize with format=Nop instead.
move(new_path, raise_if_exists=False)
move_dir(new_path)
remove()
Remove the resource at the path specified by this FileSystemTarget.
This method is implemented by using fs.
copy(new_path, raise_if_exists=False)
fn
luigi.lock module
Locking functionality when launching things from the command line. Uses a pidfile. This prevents multiple identical
workflows to be launched simultaneously.
luigi.lock.getpcmd(pid)
Returns command of process.
Parameters pid –
luigi.lock.get_info(pid_dir, my_pid=None)
luigi.lock.acquire_for(pid_dir, num_available=1, kill_signal=None)
Makes sure the process is only run once at the same time with the same name.
Notice that we since we check the process name, different parameters to the same command can spawn mul-
tiple processes at the same time, i.e. running “/usr/bin/my_process” does not prevent anyone from launching
“/usr/bin/my_process –foo bar”.
luigi.metrics module
class luigi.metrics.MetricsCollectors
Bases: enum.Enum
default = 1
none = 1
datadog = 2
prometheus = 3
get = <bound method EnumMeta.get of <enum 'MetricsCollectors'>>
class luigi.metrics.MetricsCollector
Bases: object
Abstractable MetricsCollector base class that can be replace by tool specific implementation.
handle_task_started(task)
handle_task_failed(task)
handle_task_disabled(task, config)
handle_task_done(task)
generate_latest()
configure_http_handler(http_handler)
class luigi.metrics.NoMetricsCollector
Bases: luigi.metrics.MetricsCollector
Empty MetricsCollector when no collector is being used
handle_task_started(task)
handle_task_failed(task)
handle_task_disabled(task, config)
handle_task_done(task)
luigi.mock module
This module provides a class MockTarget, an implementation of Target. MockTarget contains all data in-
memory. The main purpose is unit testing workflows without writing to disk.
class luigi.mock.MockFileSystem
Bases: luigi.target.FileSystem
MockFileSystem inspects/modifies _data to simulate file system operations.
copy(path, dest, raise_if_exists=False)
Copies the contents of a single file path to dest
get_all_data()
get_data(fn)
exists(path)
Return True if file or directory at path exist, False otherwise
Parameters path (str) – a path within the FileSystem to check for existence.
remove(path, recursive=True, skip_trash=True)
Removes the given mockfile. skip_trash doesn’t have any meaning.
move(path, dest, raise_if_exists=False)
Moves a single file from path to dest
listdir(path)
listdir does a prefix match of self.get_all_data(), but doesn’t yet support globs.
isdir(path)
Return True if the location at path is a directory. If not, return False.
Parameters path (str) – a path within the FileSystem to check as a directory.
Note: This method is optional, not all FileSystem subclasses implements it.
mkdir(path, parents=True, raise_if_exists=False)
mkdir is a noop.
clear()
class luigi.mock.MockTarget(fn, is_tmp=None, mirror_on_stderr=False, format=None)
Bases: luigi.target.FileSystemTarget
fs = <luigi.mock.MockFileSystem object>
exists()
Returns True if the path for this FileSystemTarget exists; False otherwise.
This method is implemented by using fs.
move(path, raise_if_exists=False)
Call MockFileSystem’s move command
rename(*args, **kwargs)
Call move to rename self
open(mode=’r’)
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions. Using b is not supported; initialize with format=Nop instead.
luigi.notifications module
[email]
[email protected]
And then check your email inbox to see if you got an error email or any other kind of notifications that you
expected.
raise_in_complete = BoolParameter (defaults to False): If true, fail in complete() ins
run()
The task run method, to be overridden in a subclass.
See Task.run
complete()
If the task has any outputs, return True if all outputs exist. Otherwise, return False.
However, you may freely override this method with custom logic.
class luigi.notifications.email(*args, **kwargs)
Bases: luigi.task.Config
force_send = BoolParameter (defaults to False): Send e-mail even from a tty
format = ChoiceParameter (defaults to plain): Format type for sent e-mails Choices: {
method = ChoiceParameter (defaults to smtp): Method for sending e-mail Choices: {ses,
prefix = Parameter (defaults to ): Prefix for subject lines of all e-mails
receiver = Parameter (defaults to ): Address to send error e-mails to
sender = Parameter (defaults to luigi-client@build-10934781-project-12134-luigi): Addr
class luigi.notifications.smtp(*args, **kwargs)
Bases: luigi.task.Config
host = Parameter (defaults to localhost): Hostname of smtp server
local_hostname = Parameter (defaults to None): If specified, local_hostname is used as
no_tls = BoolParameter (defaults to False): Do not use TLS in SMTP connections
password = Parameter (defaults to None): Password for the SMTP server login
port = IntParameter (defaults to 0): Port number for smtp server
ssl = BoolParameter (defaults to False): Use SSL for the SMTP connection.
timeout = FloatParameter (defaults to 10.0): Number of seconds before timing out the s
username = Parameter (defaults to None): Username used to log in to the SMTP host
class luigi.notifications.sendgrid(*args, **kwargs)
Bases: luigi.task.Config
apikey = Parameter: API key for SendGrid login
luigi.notifications.generate_email(sender, subject, message, recipients, image_png)
luigi.notifications.wrap_traceback(traceback)
For internal use only (until further notice)
luigi.notifications.send_email_smtp(sender, subject, message, recipients, image_png)
luigi.notifications.send_email_ses(sender, subject, message, recipients, image_png)
Sends notification through AWS SES.
Does not handle access keys. Use either 1/ configuration file 2/ EC2 instance profile
See also https://boto3.readthedocs.io/en/latest/guide/configuration.html.
luigi.notifications.send_email_sendgrid(sender, subject, message, recipients, image_png)
luigi.notifications.send_email_sns(sender, subject, message, topic_ARN, image_png)
Sends notification through AWS SNS. Takes Topic ARN from recipients.
Does not handle access keys. Use either 1/ configuration file 2/ EC2 instance profile
luigi.parameter module
Parameters are one of the core concepts of Luigi. All Parameters sit on Task classes. See Parameter for more info
on how to define parameters.
class luigi.parameter.ParameterVisibility
Bases: enum.IntEnum
Possible values for the parameter visibility option. Public is the default. See Parameters for more info.
PUBLIC = 0
HIDDEN = 1
PRIVATE = 2
has_value = <bound method EnumMeta.has_value of <enum 'ParameterVisibility'>>
serialize()
exception luigi.parameter.ParameterException
Bases: exceptions.Exception
Base exception.
exception luigi.parameter.MissingParameterException
Bases: luigi.parameter.ParameterException
Exception signifying that there was a missing Parameter.
exception luigi.parameter.UnknownParameterException
Bases: luigi.parameter.ParameterException
Exception signifying that an unknown Parameter was supplied.
exception luigi.parameter.DuplicateParameterException
Bases: luigi.parameter.ParameterException
Exception signifying that a Parameter was specified multiple times.
class MyTask(luigi.Task):
foo = luigi.Parameter()
class RequiringTask(luigi.Task):
def requires(self):
return MyTask(foo="hello")
def run(self):
print(self.requires().foo) # prints "hello"
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
has_task_value(task_name, param_name)
task_value(task_name, param_name)
parse(x)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
serialize(x)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
normalize(x)
Given a parsed parameter value, normalizes it.
The value can either be the result of parse(), the default value or arguments passed into the task’s construc-
tor by instantiation.
This is very implementation defined, but can be used to validate/clamp valid values. For example, if you
wanted to only accept even integers, and “correct” odd values to the nearest integer, you can implement
normalize as x // 2 * 2.
next_in_enumeration(_value)
If your Parameter type has an enumerable ordering of values. You can choose to override this method.
This method is used by the luigi.execution_summary module for pretty printing purposes.
Enabling it to pretty print tasks like MyTask(num=1), MyTask(num=2), MyTask(num=3) to
MyTask(num=1..3).
Parameters value – The value
Returns The next value, like “value + 1”. Or None if there’s no enumerable ordering.
class luigi.parameter.OptionalParameter(default=<object object>, is_global=False,
significant=True, description=None, con-
fig_path=None, positional=True, al-
ways_in_help=False, batch_method=None,
visibility=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
A Parameter that treats empty string as None
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
serialize(x)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
parse(x)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
class luigi.parameter.DateParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter._DateParameterBase
Parameter whose value is a date.
A DateParameter is a Date string formatted YYYY-MM-DD. For example, 2013-07-10 specifies July 10, 2013.
DateParameters are 90% of the time used to be interpolated into file system paths or the like. Here is a gentle
reminder of how to interpolate date parameters into strings:
class MyTask(luigi.Task):
date = luigi.DateParameter()
def run(self):
templated_path = "/my/path/to/my/dataset/{date:%Y/%m/%d}/"
instantiated_path = templated_path.format(date=self.date)
# print(instantiated_path) --> /my/path/to/my/dataset/2016/06/09/
# ... use instantiated_path ...
To set this parameter to default to the current day. You can write code like this:
import datetime
class MyTask(luigi.Task):
date = luigi.DateParameter(default=datetime.date.today())
date_format = '%Y-%m-%d'
next_in_enumeration(value)
If your Parameter type has an enumerable ordering of values. You can choose to override this method.
This method is used by the luigi.execution_summary module for pretty printing purposes.
Enabling it to pretty print tasks like MyTask(num=1), MyTask(num=2), MyTask(num=3) to
MyTask(num=1..3).
Parameters value – The value
Returns The next value, like “value + 1”. Or None if there’s no enumerable ordering.
normalize(value)
Given a parsed parameter value, normalizes it.
The value can either be the result of parse(), the default value or arguments passed into the task’s construc-
tor by instantiation.
This is very implementation defined, but can be used to validate/clamp valid values. For example, if you
wanted to only accept even integers, and “correct” odd values to the nearest integer, you can implement
normalize as x // 2 * 2.
class luigi.parameter.MonthParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter.DateParameter
Parameter whose value is a date, specified to the month (day of date is “rounded” to first of the month).
A MonthParameter is a Date string formatted YYYY-MM. For example, 2013-07 specifies July of 2013. Task
objects constructed from code accept date (ignoring the day value) or Month.
date_format = '%Y-%m'
next_in_enumeration(value)
If your Parameter type has an enumerable ordering of values. You can choose to override this method.
This method is used by the luigi.execution_summary module for pretty printing purposes.
Enabling it to pretty print tasks like MyTask(num=1), MyTask(num=2), MyTask(num=3) to
MyTask(num=1..3).
Parameters value – The value
Returns The next value, like “value + 1”. Or None if there’s no enumerable ordering.
normalize(value)
Given a parsed parameter value, normalizes it.
The value can either be the result of parse(), the default value or arguments passed into the task’s construc-
tor by instantiation.
This is very implementation defined, but can be used to validate/clamp valid values. For example, if you
wanted to only accept even integers, and “correct” odd values to the nearest integer, you can implement
normalize as x // 2 * 2.
class luigi.parameter.YearParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter.DateParameter
Parameter whose value is a date, specified to the year (day and month of date is “rounded” to first day of the
year).
A YearParameter is a Date string formatted YYYY. Task objects constructed from code accept date (ignoring
the month and day values) or Year.
date_format = '%Y'
next_in_enumeration(value)
If your Parameter type has an enumerable ordering of values. You can choose to override this method.
This method is used by the luigi.execution_summary module for pretty printing purposes.
Enabling it to pretty print tasks like MyTask(num=1), MyTask(num=2), MyTask(num=3) to
MyTask(num=1..3).
Parameters value – The value
Returns The next value, like “value + 1”. Or None if there’s no enumerable ordering.
normalize(value)
Given a parsed parameter value, normalizes it.
The value can either be the result of parse(), the default value or arguments passed into the task’s construc-
tor by instantiation.
This is very implementation defined, but can be used to validate/clamp valid values. For example, if you
wanted to only accept even integers, and “correct” odd values to the nearest integer, you can implement
normalize as x // 2 * 2.
class luigi.parameter.DateHourParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter._DatetimeParameterBase
Parameter whose value is a datetime specified to the hour.
A DateHourParameter is a ISO 8601 formatted date and time specified to the hour. For example,
2013-07-10T19 specifies July 10, 2013 at 19:00.
date_format = '%Y-%m-%dT%H'
class luigi.parameter.DateMinuteParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter._DatetimeParameterBase
Parameter whose value is a datetime specified to the minute.
A DateMinuteParameter is a ISO 8601 formatted date and time specified to the minute. For example,
2013-07-10T1907 specifies July 10, 2013 at 19:07.
The interval parameter can be used to clamp this parameter to every N minutes, instead of every minute.
date_format = '%Y-%m-%dT%H%M'
deprecated_date_format = '%Y-%m-%dT%HH%M'
parse(s)
Parses a string to a datetime.
class luigi.parameter.DateSecondParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter._DatetimeParameterBase
Parameter whose value is a datetime specified to the second.
A DateSecondParameter is a ISO 8601 formatted date and time specified to the second. For example,
2013-07-10T190738 specifies July 10, 2013 at 19:07:38.
The interval parameter can be used to clamp this parameter to every N seconds, instead of every second.
date_format = '%Y-%m-%dT%H%M%S'
class MyTask(luigi.Task):
implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_
˓→PARSING)
explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_
˓→PARSING)
or globally by
luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING
parse(input)
Parses a time delta from the input.
See TimeDeltaParameter for details on supported formats.
serialize(x)
Converts datetime.timedelta to a string
Parameters x – the value to serialize.
class luigi.parameter.TaskParameter(default=<object object>, is_global=False,
significant=True, description=None, con-
fig_path=None, positional=True, al-
ways_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
A parameter that takes another luigi task class.
When used programatically, the parameter should be specified directly with the luigi.task.Task (sub)
class. Like MyMetaTask(my_task_param=my_tasks.MyTask). On the command line, you specify
the luigi.task.Task.get_task_family(). Like
parse(input)
Parse a task_famly using the Register
serialize(cls)
Converts the luigi.task.Task (sub) class to its family name.
class luigi.parameter.EnumParameter(*args, **kwargs)
Bases: luigi.parameter.Parameter
A parameter whose value is an Enum.
In the task definition, use
class Model(enum.Enum):
Honda = 1
Volvo = 2
class MyTask(luigi.Task):
my_param = luigi.EnumParameter(enum=Model)
parse(s)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
serialize(e)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
class luigi.parameter.EnumListParameter(*args, **kwargs)
Bases: luigi.parameter.Parameter
A parameter whose value is a comma-separated list of Enum. Values should come from the same enum.
Values are taken to be a list, i.e. order is preserved, duplicates may occur, and empty list is possible.
In the task definition, use
class Model(enum.Enum):
Honda = 1
Volvo = 2
class MyTask(luigi.Task):
my_param = luigi.EnumListParameter(enum=Model)
parse(s)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
serialize(enum_values)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
class luigi.parameter.DictParameter(default=<object object>, is_global=False,
significant=True, description=None, con-
fig_path=None, positional=True, al-
ways_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
Parameter whose value is a dict.
In the task definition, use
class MyTask(luigi.Task):
tags = luigi.DictParameter()
def run(self):
logging.info("Find server with role: %s", self.tags['role'])
server = aws.ec2.find_my_resource(self.tags)
It can be used to define dynamic parameters, when you do not know the exact list of your parameters (e.g. list
of tags, that are dynamically constructed outside Luigi), or you have a complex parameter containing logically
related values (like a database connection config).
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
normalize(value)
Ensure that dictionary parameter is converted to a FrozenOrderedDict so it can be hashed.
parse(source)
Parses an immutable and ordered dict from a JSON string using standard JSON library.
We need to use an immutable dictionary, to create a hashable parameter and also preserve the internal
structure of parsing. The traversal order of standard dict is undefined, which can result various string
representations of this parameter, and therefore a different task id for the task containing this parameter.
This is because task id contains the hash of parameters’ JSON representation.
Parameters s – String to be parse
serialize(x)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
class luigi.parameter.ListParameter(default=<object object>, is_global=False,
significant=True, description=None, con-
fig_path=None, positional=True, al-
ways_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
Parameter whose value is a list.
In the task definition, use
class MyTask(luigi.Task):
grades = luigi.ListParameter()
def run(self):
sum = 0
for element in self.grades:
sum += element
avg = sum / len(self.grades)
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
normalize(x)
Ensure that struct is recursively converted to a tuple so it can be hashed.
Parameters x (str) – the value to parse.
Returns the normalized (hashable/immutable) value.
parse(x)
Parse an individual value from the input.
Parameters x (str) – the value to parse.
Returns the parsed value.
serialize(x)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
class luigi.parameter.TupleParameter(default=<object object>, is_global=False,
significant=True, description=None, con-
fig_path=None, positional=True, al-
ways_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.ListParameter
Parameter whose value is a tuple or tuple of tuples.
In the task definition, use
class MyTask(luigi.Task):
book_locations = luigi.TupleParameter()
def run(self):
for location in self.book_locations:
print("Go to page %d, line %d" % (location[0], location[1]))
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
parse(x)
Parse an individual value from the input.
Parameters x (str) – the value to parse.
Returns the parsed value.
class luigi.parameter.NumericalParameter(left_op=<built-in function le>, right_op=<built-
in function lt>, *args, **kwargs)
Bases: luigi.parameter.Parameter
Parameter whose value is a number of the specified type, e.g. int or float and in the range specified.
class MyTask(luigi.Task):
my_param_1 = luigi.NumericalParameter(
var_type=int, min_value=-3, max_value=7) # -3 <= my_param_1 < 7
my_param_2 = luigi.NumericalParameter(
var_type=int, min_value=-3, max_value=7, left_op=operator.lt, right_
˓→op=operator.le) # -3 < my_param_2 <= 7
Parameters
• var_type (function) – The type of the input variable, e.g. int or float.
• min_value – The minimum value permissible in the accepted values range. May be
inclusive or exclusive based on left_op parameter. This should be the same type as var_type.
• max_value – The maximum value permissible in the accepted values range. May be in-
clusive or exclusive based on right_op parameter. This should be the same type as var_type.
• left_op (function) – The comparison operator for the left-most comparison in the
expression min_value left_op value right_op value. This operator should
generally be either operator.lt or operator.le. Default: operator.le.
• right_op (function) – The comparison operator for the right-most comparison in the
expression min_value left_op value right_op value. This operator should
generally be either operator.lt or operator.le. Default: operator.lt.
parse(s)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
class luigi.parameter.ChoiceParameter(var_type=<type ’str’>, *args, **kwargs)
Bases: luigi.parameter.Parameter
A parameter which takes two values:
1. an instance of Iterable and
2. the class of the variables to convert to.
In the task definition, use
class MyTask(luigi.Task):
my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)
Consider using EnumParameter for a typed, structured alternative. This class can perform the same role
when all choices are the same type and transparency of parameter value on the command line is desired.
Parameters
• var_type (function) – The type of the input variable, e.g. str, int, float, etc. Default:
str
• choices – An iterable, all of whose elements are of var_type to restrict parameter choices
to.
parse(s)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
normalize(var)
Given a parsed parameter value, normalizes it.
The value can either be the result of parse(), the default value or arguments passed into the task’s construc-
tor by instantiation.
This is very implementation defined, but can be used to validate/clamp valid values. For example, if you
wanted to only accept even integers, and “correct” odd values to the nearest integer, you can implement
normalize as x // 2 * 2.
luigi.process module
luigi.retcodes module
Module containing the logic for exit codes for the luigi binary. It’s useful when you in a programmatic way need to
know if luigi actually finished the given task, and if not why.
class luigi.retcodes.retcode(*args, **kwargs)
Bases: luigi.task.Config
See the return codes configuration section.
unhandled_exception = IntParameter (defaults to 4): For internal luigi errors.
missing_data = IntParameter (defaults to 0): For when there are incomplete ExternalTas
task_failed = IntParameter (defaults to 0): For when a task's run() method fails.
already_running = IntParameter (defaults to 0): For both local --lock and luigid "lock
scheduling_error = IntParameter (defaults to 0): For when a task's complete() or requi
not_run = IntParameter (defaults to 0): For when a task is not granted run permission
luigi.retcodes.run_with_retcodes(argv)
Run luigi with command line parsing, but raise SystemExit with the configured exit code.
Note: Usually you use the luigi binary directly and don’t call this function yourself.
Parameters argv – Should (conceptually) be sys.argv[1:]
luigi.rpc module
Implementation of the REST interface between the workers and the server. rpc.py implements the client side of it,
server.py implements the server side. See Using the Central Scheduler for more info.
exception luigi.rpc.RPCError(message, sub_exception=None)
Bases: exceptions.Exception
class luigi.rpc.URLLibFetcher
Bases: object
raises = (<class 'urllib2.URLError'>, <class 'socket.timeout'>)
fetch(full_url, body, timeout)
class luigi.rpc.RequestsFetcher(session)
Bases: object
check_pid()
fetch(full_url, body, timeout)
class luigi.rpc.RemoteScheduler(url=’http://localhost:8082/’, connect_timeout=None)
Bases: object
Scheduler proxy object. Talks to a RemoteSchedulerResponder.
add_scheduler_message_response(*args, **kwargs)
add_task(*args, **kwargs)
• add task identified by task_id if it doesn’t exist
• if deps is not None, update dependency list
• update status of task
• add additional workers/stakeholders
• update priority when needed
add_task_batcher(*args, **kwargs)
add_worker(*args, **kwargs)
announce_scheduling_failure(*args, **kwargs)
count_pending(*args, **kwargs)
decrease_running_task_resources(*args, **kwargs)
dep_graph(*args, **kwargs)
disable_worker(*args, **kwargs)
fetch_error(*args, **kwargs)
forgive_failures(*args, **kwargs)
get_running_task_resources(*args, **kwargs)
get_scheduler_message_response(*args, **kwargs)
get_task_progress_percentage(*args, **kwargs)
get_task_status_message(*args, **kwargs)
get_work(*args, **kwargs)
graph(*args, **kwargs)
has_task_history(*args, **kwargs)
inverse_dep_graph(*args, **kwargs)
is_pause_enabled(*args, **kwargs)
is_paused(*args, **kwargs)
mark_as_done(*args, **kwargs)
pause(*args, **kwargs)
ping(*args, **kwargs)
prune(*args, **kwargs)
re_enable_task(*args, **kwargs)
resource_list(*args, **kwargs)
Resources usage info and their consumers (tasks).
send_scheduler_message(*args, **kwargs)
set_task_progress_percentage(*args, **kwargs)
set_task_status_message(*args, **kwargs)
set_worker_processes(*args, **kwargs)
task_list(*args, **kwargs)
Query for a subset of tasks by status.
task_search(*args, **kwargs)
Query for a subset of tasks by task_id.
Parameters task_str –
Returns
unpause(*args, **kwargs)
update_metrics_task_started(*args, **kwargs)
update_resource(*args, **kwargs)
update_resources(*args, **kwargs)
worker_list(*args, **kwargs)
luigi.scheduler module
The system for scheduling tasks and executing them in order. Deals with dependencies, priorities, resources, etc. The
Worker pulls tasks from the scheduler (usually over the REST interface) and executes them. See Using the Central
Scheduler for more info.
luigi.scheduler.UPSTREAM_SEVERITY_KEY()
T.index(value, [start, [stop]]) -> integer – return first index of value. Raises ValueError if the value is not present.
add_failure()
Add a failure event with the current timestamp.
num_failures()
Return the number of failures in the window.
clear()
Clear the failure queue.
class luigi.scheduler.OrderedSet(iterable=None)
Bases: _abcoll.MutableSet
Standard Python OrderedSet recipe found at http://code.activestate.com/recipes/576694/
Modified to include a peek function to get the last element
add(key)
Add an element.
discard(key)
Remove an element. Do not raise an exception if absent.
peek(last=True)
pop(last=True)
Return the popped value. Raise KeyError if empty.
class luigi.scheduler.Task(task_id, status, deps, resources=None, priority=0, family=”,
module=None, params=None, param_visibilities=None, ac-
cepts_messages=False, tracking_url=None, status_message=None,
progress_percentage=None, retry_policy=’notoptional’)
Bases: object
set_params(params)
is_batchable()
add_failure()
has_excessive_failures()
pretty_id
class luigi.scheduler.Worker(worker_id, last_active=None)
Bases: object
Structure for tracking worker activity and keeping their references.
add_info(info)
update(worker_reference, get_work=False)
prune(config)
get_tasks(state, *statuses)
is_trivial_worker(state)
If it’s not an assistant having only tasks that are without requirements.
We have to pass the state parameter for optimization reasons.
assistant
enabled
state
add_rpc_message(name, **kwargs)
fetch_rpc_messages()
class luigi.scheduler.SimpleTaskState(state_path)
Bases: object
Keep track of the current state and handle persistence.
The point of this class is to enable other ways to keep state, eg. by using a database These will be implemented
by creating an abstract base class that this and other classes inherit from.
get_state()
set_state(state)
dump()
load()
get_active_tasks()
get_active_tasks_by_status(*statuses)
get_active_task_count_for_status(status)
get_batch_running_tasks(batch_id)
set_batcher(worker_id, family, batcher_args, max_batch_size)
get_batcher(worker_id, family)
num_pending_tasks()
Return how many tasks are PENDING + RUNNING. O(1).
get_task(task_id, default=None, setdefault=None)
has_task(task_id)
re_enable(task, config=None)
set_batch_running(task, batch_id, worker_id)
set_status(task, new_status, config=None)
fail_dead_worker_task(task, config, assistants)
update_status(task, config)
may_prune(task)
inactivate_tasks(delete_tasks)
get_active_workers(last_active_lt=None, last_get_work_gt=None)
get_assistants(last_active_lt=None)
get_worker_ids()
get_worker(worker_id)
inactivate_workers(delete_workers)
disable_workers(worker_ids)
update_metrics(task, config)
class luigi.scheduler.Scheduler(config=None, resources=None, task_history_impl=None,
**kwargs)
Bases: object
Async scheduler that can handle multiple workers, etc.
luigi.server module
Simple REST server that takes commands in a JSON payload Interface to the Scheduler class. See Using the
Central Scheduler for more info.
class luigi.server.cors(*args, **kwargs)
Bases: luigi.task.Config
enabled = BoolParameter (defaults to False): Enables CORS support.
allow_any_origin = BoolParameter (defaults to False): Accepts requests from any origin
allow_null_origin = BoolParameter (defaults to False): Allows the request to set `null
max_age = IntParameter (defaults to 86400): Content of `Access-Control-Max-Age`.
allowed_methods = Parameter (defaults to GET, OPTIONS): Content of `Access-Control-Allo
allowed_headers = Parameter (defaults to Accept, Content-Type, Origin): Content of `Ac
exposed_headers = Parameter (defaults to ): Content of `Access-Control-Expose-Headers`
allow_credentials = BoolParameter (defaults to False): Indicates that the actual reque
class ProfileHandler(RequestHandler):
def initialize(self, database):
self.database = database
app = Application([
(r'/user/(.*)', ProfileHandler, dict(database=database)),
])
options(*args)
get(method)
post(method)
class luigi.server.BaseTaskHistoryHandler(application, request, **kwargs)
Bases: tornado.web.RequestHandler
initialize(scheduler)
Hook for subclass initialization. Called for each request.
A dictionary passed as the third argument of a url spec will be supplied as keyword arguments to initialize().
Example:
class ProfileHandler(RequestHandler):
def initialize(self, database):
self.database = database
app = Application([
(r'/user/(.*)', ProfileHandler, dict(database=database)),
])
get_template_path()
Override to customize template path for each handler.
By default, we use the template_path application setting. Return None to load templates relative to
the calling file.
class luigi.server.AllRunHandler(application, request, **kwargs)
Bases: luigi.server.BaseTaskHistoryHandler
get()
class ProfileHandler(RequestHandler):
def initialize(self, database):
self.database = database
app = Application([
(r'/user/(.*)', ProfileHandler, dict(database=database)),
])
get()
luigi.server.app(scheduler)
luigi.server.run(api_port=8082, address=None, unix_socket=None, scheduler=None)
Runs one instance of the API server.
luigi.server.stop()
luigi.setup_logging module
This module contains helper classes for configuring logging for luigid and workers via command line arguments and
options from config files.
class luigi.setup_logging.BaseLogging
Bases: object
config = <luigi.configuration.cfg_parser.LuigiConfigParser instance>
classmethod setup(opts=<class ’luigi.setup_logging.opts’>)
Setup logging via CLI params and config.
class luigi.setup_logging.DaemonLogging
Bases: luigi.setup_logging.BaseLogging
Configure logging for luigid
class luigi.setup_logging.InterfaceLogging
Bases: luigi.setup_logging.BaseLogging
Configure logging for worker
luigi.six module
luigi.target module
The abstract Target class. It is a central concept of Luigi and represents the state of the workflow.
class luigi.target.Target
Bases: object
A Target is a resource generated by a Task.
For example, a Target might correspond to a file in HDFS or data in a database. The Target interface defines one
method that must be overridden: exists(), which signifies if the Target has been created or not.
Typically, a Task will define one or more Targets as output, and the Task is considered complete if and only if
each of its output Targets exist.
exists()
Returns True if the Target exists and False otherwise.
exception luigi.target.FileSystemException
Bases: exceptions.Exception
Base class for generic file system exceptions.
exception luigi.target.FileAlreadyExists
Bases: luigi.target.FileSystemException
Raised when a file system operation can’t be performed because a directory exists but is required to not exist.
exception luigi.target.MissingParentDirectory
Bases: luigi.target.FileSystemException
Raised when a parent directory doesn’t exist. (Imagine mkdir without -p)
exception luigi.target.NotADirectory
Bases: luigi.target.FileSystemException
Raised when a file system operation can’t be performed because an expected directory is actually a file.
class luigi.target.FileSystem
Bases: object
FileSystem abstraction used in conjunction with FileSystemTarget.
Typically, a FileSystem is associated with instances of a FileSystemTarget. The instances of
the py:class:FileSystemTarget will delegate methods such as FileSystemTarget.exists() and
FileSystemTarget.remove() to the FileSystem.
Methods of FileSystem raise FileSystemException if there is a problem completing the operation.
exists(path)
Return True if file or directory at path exist, False otherwise
Parameters path (str) – a path within the FileSystem to check for existence.
remove(path, recursive=True, skip_trash=True)
Remove file or directory at location path
Parameters
• path (str) – a path within the FileSystem to remove.
• recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to True.
mkdir(path, parents=True, raise_if_exists=False)
Create directory at location path
Creates the directory at path and implicitly create parent directories if they do not already exist.
Parameters
• path (str) – a path within the FileSystem to create as a directory.
• parents (bool) – Create parent directories when necessary. When parents=False and
the parent directory doesn’t exist, raise luigi.target.MissingParentDirectory
• raise_if_exists (bool) – raise luigi.target.FileAlreadyExists if the folder already
exists.
isdir(path)
Return True if the location at path is a directory. If not, return False.
Parameters path (str) – a path within the FileSystem to check as a directory.
Note: This method is optional, not all FileSystem subclasses implements it.
listdir(path)
Return a list of files rooted in path.
This returns an iterable of the files rooted at path. This is intended to be a recursive listing.
Parameters path (str) – a path within the FileSystem to list.
Note: This method is optional, not all FileSystem subclasses implements it.
move(path, dest)
Move a file, as one would expect.
rename_dont_move(path, dest)
Potentially rename path to dest, but don’t move it into the dest folder (if it is a folder). This relates
to Atomic Writes Problem.
This method has a reasonable but not bullet proof default implementation. It will just do move() if the
file doesn’t exists() already.
rename(*args, **kwargs)
Alias for move()
copy(path, dest)
Copy a file or a directory with contents. Currently, LocalFileSystem and MockFileSystem support only
single file copying but S3Client copies either a file or a directory as required.
class luigi.target.FileSystemTarget(path)
Bases: luigi.target.Target
Base class for FileSystem Targets like LocalTarget and HdfsTarget.
A FileSystemTarget has an associated FileSystem to which certain operations can be delegated. By default,
exists() and remove() are delegated to the FileSystem, which is determined by the fs property.
Methods of FileSystemTarget raise FileSystemException if there is a problem completing the operation.
Usage:
target = FileSystemTarget('~/some_file.txt')
target = FileSystemTarget(pathlib.Path('~') / 'some_file.txt')
target.exists() # False
class MyTask(luigi.Task):
def output(self):
return MyFileSystemTarget(...)
def run(self):
with self.output().temporary_path() as self.temp_output_path:
run_some_external_command(output_path=self.temp_output_path)
class luigi.target.AtomicLocalFile(path)
Bases: _io.BufferedWriter
Abstract class to create a Target that creates a temporary file in the local filesystem before moving it to its final
destination.
This class is just for the writing part of the Target. See luigi.local_target.LocalTarget for example
close()
Flush and close the IO object.
This method has no effect if the file is already closed.
generate_tmp_path(path)
move_to_final_destination()
tmp_path
luigi.task module
The abstract Task class. It is a central concept of Luigi and represents the state of the workflow. See Tasks for an
overview.
luigi.task.namespace(namespace=None, scope=”)
Call to set namespace of tasks declared after the call.
It is often desired to call this function with the keyword argument scope=__name__.
The scope keyword makes it so that this call is only effective for task classes with a matching*0 __module__.
The default value for scope is the empty string, which means all classes. Multiple calls with the same scope
simply replace each other.
The namespace of a Task can also be changed by specifying the property task_namespace.
class Task2(luigi.Task):
task_namespace = 'namespace2'
This explicit setting takes priority over whatever is set in the namespace() method, and it’s also inherited
through normal python inheritence.
There’s no equivalent way to set the task_family.
New since Luigi 2.6.0: scope keyword argument.
See also:
The new and better scaling auto_namespace()
0 When there are multiple levels of matching module scopes like a.b vs a.b.c, the more specific one (a.b.c) wins.
luigi.task.auto_namespace(scope=”)
Same as namespace(), but instead of a constant namespace, it will be set to the __module__ of the task
class. This is desirable for these reasons:
• Two tasks with the same name will not have conflicting task families
• It’s more pythonic, as modules are Python’s recommended way to do namespacing.
• It’s traceable. When you see the full name of a task, you can immediately identify where it is defined.
We recommend calling this function from your package’s outermost __init__.py file. The file contents
could look like this:
import luigi
luigi.auto_namespace(scope=__name__)
To reset an auto_namespace() call, you can use namespace(scope='my_scope'). But this will not
be needed (and is also discouraged) if you use the scope kwarg.
New since Luigi 2.6.0.
luigi.task.task_id_str(task_family, params)
Returns a canonical string used to identify a particular task
Parameters
• task_family – The task family (class name) of the task
• params – a dict mapping parameter names to their serialized values
Returns A unique, shortened identifier corresponding to the family and params
exception luigi.task.BulkCompleteNotImplementedError
Bases: exceptions.NotImplementedError
This is here to trick pylint.
pylint thinks anything raising NotImplementedError needs to be implemented in any subclass. bulk_complete
isn’t like that. This tricks pylint into thinking that the default implementation is a valid implementation and not
an abstract method.
class luigi.task.Task(*args, **kwargs)
Bases: object
This is the base class of all Luigi Tasks, the base unit of work in Luigi.
A Luigi Task describes a unit or work.
The key methods of a Task, which must be implemented in a subclass are:
• run() - the computation done by this task.
• requires() - the list of Tasks that this Task depends on.
• output() - the output Target that this Task creates.
Each Parameter of the Task should be declared as members:
class MyTask(luigi.Task):
count = luigi.IntParameter()
second_param = luigi.Parameter()
In addition to any declared properties and methods, there are a few non-declared properties, which are created
by the Register metaclass:
priority = 0
Priority of the task: the scheduler should favor available tasks with higher priority values first. See Task
priority
disabled = False
resources = {}
Resources used by the task. Should be formatted like {“scp”: 1} to indicate that the task requires 1 unit of
the scp resource.
worker_timeout = None
Number of seconds after which to time out the run function. No timeout if set to 0. Defaults to 0 or
worker-timeout value in config
max_batch_size = inf
Maximum number of tasks to run together as a batch. Infinite by default
batchable
True if this instance can be run as part of a batch. By default, True if it has any batched parameters
retry_count
Override this positive integer to have different retry_count at task level Check [scheduler]
disable_hard_timeout
Override this positive integer to have different disable_hard_timeout at task level. Check [sched-
uler]
disable_window_seconds
Override this positive integer to have different disable_window_seconds at task level. Check
[scheduler]
owner_email
Override this to send out additional error emails to task owner, in addition to the one defined in the
global configuration. This should return a string or a list of strings. e.g. ‘[email protected]’ or
[‘[email protected]’, ‘[email protected]’]
use_cmdline_section
Property used by core config such as –workers etc. These will be exposed without the class as prefix.
classmethod event_handler(event)
Decorator for adding event handlers.
trigger_event(event, *args, **kwargs)
Trigger that calls all of the specified events associated with this class.
accepts_messages
For configuring which scheduler messages can be received. When falsy, this tasks does not accept any
message. When True, all messages are accepted.
task_module
Returns what Python module to import to get access to this class.
task_namespace = '__not_user_specified'
This value can be overriden to set the namespace that will be used. (See Namespaces, families and
ids) If it’s not specified and you try to read this value anyway, it will return garbage. Please use
get_task_namespace() to read the namespace.
Note that setting this value with @property will not work, because this is a class level value.
classmethod get_task_namespace()
The task family for the given class.
Note: You normally don’t want to override this.
task_family = 'Task'
classmethod get_task_family()
The task family for the given class.
If task_namespace is not set, then it’s simply the name of the class. Otherwise,
<task_namespace>. is prefixed to the class name.
Note: You normally don’t want to override this.
classmethod get_params()
Returns all of the Parameters for this Task.
classmethod batch_param_names()
classmethod get_param_names(include_significant=False)
classmethod get_param_values(params, args, kwargs)
Get the values of the parameters from the args and kwargs.
Parameters
• params – list of (param_name, Parameter).
• args – positional arguments
• kwargs – keyword arguments.
Returns list of (name, value) tuples, one for each parameter.
param_args
initialized()
Returns True if the Task is initialized and False otherwise.
classmethod from_str_params(params_str)
Creates an instance from a str->str hash.
Parameters params_str – dict of param name -> value as string.
to_str_params(only_significant=False, only_public=False)
Convert all parameters to a str->str hash.
clone(cls=None, **kwargs)
Creates a new instance from an existing instance where some of the args have changed.
There’s at least two scenarios where this is useful (see test/clone_test.py):
• remove a lot of boiler plate when you have recursive dependencies and lots of args
• there’s task inheritance and some logic is on the base class
Parameters
• cls –
• kwargs –
Returns
complete()
If the task has any outputs, return True if all outputs exist. Otherwise, return False.
However, you may freely override this method with custom logic.
classmethod bulk_complete(parameter_tuples)
Returns those of parameter_tuples for which this Task is complete.
Override (with an efficient implementation) for efficient scheduling with range tools. Keep the logic
consistent with that of complete().
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
requires()
The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any
other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method
to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
process_resources()
Override in “template” tasks which provide common resource functionality but allow subclasses to specify
additional resources while preserving the name for consistent end-user experience.
input()
Returns the outputs of the Tasks returned by requires()
See Task.input
Returns a list of Target objects which are specified as outputs of all required Tasks.
deps()
Internal method used by the scheduler.
Returns the flattened list of requires.
run()
The task run method, to be overridden in a subclass.
See Task.run
on_failure(exception)
Override for custom error handling.
This method gets called if an exception is raised in run(). The returned value of this method is json
encoded and sent to the scheduler as the expl argument. Its string representation will be used as the body
of the error email sent out if any.
Default behavior is to return a string representation of the stack trace.
on_success()
Override for doing custom completion handling for a larger class of tasks
This method gets called when run() completes without raising any exceptions.
The returned value is json encoded and sent to the scheduler as the expl argument.
Default behavior is to send an None value
no_unpicklable_properties(**kwds)
Remove unpicklable properties before dump task and resume them after.
This method could be called in subtask’s dump method, to ensure unpicklable properties won’t break
dump.
This method is a context-manager which can be called as below:
class luigi.task.MixinNaiveBulkComplete
Bases: object
Enables a Task to be efficiently scheduled with e.g. range tools, by providing a bulk_complete implementation
which checks completeness in a loop.
Applicable to tasks whose completeness checking is cheap.
This doesn’t exploit output location specific APIs for speed advantage, nevertheless removes redundant sched-
uler roundtrips.
classmethod bulk_complete(parameter_tuples)
class luigi.task.ExternalTask(*args, **kwargs)
Bases: luigi.task.Task
Subclass for references to external dependencies.
An ExternalTask’s does not have a run implementation, which signifies to the framework that this Task’s
output() is generated outside of Luigi.
run = None
luigi.task.externalize(taskclass_or_taskobject)
Returns an externalized version of a Task. You may both pass an instantiated task object or a task class. Some
examples:
class RequiringTask(luigi.Task):
def requires(self):
task_object = self.clone(MyTask)
return externalize(task_object)
...
Here’s mostly equivalent code, but externalize is applied to a task class instead.
@luigi.util.requires(externalize(MyTask))
class RequiringTask(luigi.Task):
pass
...
Of course, it may also be used directly on classes and objects (for example for reexporting or other usage).
MyTask = externalize(MyTask)
my_task_2 = externalize(MyTask2(param='foo'))
If you however want a task class to be external from the beginning, you’re better off inheriting ExternalTask
rather than Task.
This function tries to be side-effect free by creating a copy of the class or the object passed in and then modify
that object. In particular this code shouldn’t do anything.
luigi.task.flatten_output(task)
Lists all output targets by recursively walking output-less (wrapper) tasks.
FIXME order consistently.
luigi.task_history module
Abstract class for task history. Currently the only subclass is DbTaskHistory.
class luigi.task_history.StoredTask(task, status, host=None)
Bases: object
Interface for methods on TaskHistory
task_family
parameters
class luigi.task_history.TaskHistory
Bases: object
Abstract Base Class for updating the run history of a task
task_scheduled(task)
task_finished(task, successful)
task_started(task, worker_host)
class luigi.task_history.NopHistory
Bases: luigi.task_history.TaskHistory
task_scheduled(task)
task_finished(task, successful)
task_started(task, worker_host)
luigi.task_register module
luigi.task_status module
luigi.util module
Most luigi plumbers will find themselves in an awkward task parameter situation at some point or another. Consider
the following “parameter explosion” problem:
class TaskA(luigi.ExternalTask):
param_a = luigi.Parameter()
def output(self):
return luigi.LocalTarget('/tmp/log-{t.param_a}'.format(t=self))
class TaskB(luigi.Task):
param_b = luigi.Parameter()
param_a = luigi.Parameter()
def requires(self):
return TaskA(param_a=self.param_a)
class TaskC(luigi.Task):
param_c = luigi.Parameter()
param_b = luigi.Parameter()
param_a = luigi.Parameter()
def requires(self):
return TaskB(param_b=self.param_b, param_a=self.param_a)
In work flows requiring many tasks to be chained together in this manner, parameter handling can spiral out of control.
Each downstream task becomes more burdensome than the last. Refactoring becomes more difficult. There are several
ways one might try and avoid the problem.
Approach 1: Parameters via command line or config instead of requires().
class TaskA(luigi.ExternalTask):
param_a = luigi.Parameter()
def output(self):
return luigi.LocalTarget('/tmp/log-{t.param_a}'.format(t=self))
class TaskB(luigi.Task):
param_b = luigi.Parameter()
def requires(self):
return TaskA()
class TaskC(luigi.Task):
param_c = luigi.Parameter()
def requires(self):
return TaskB()
luigi --module my_tasks TaskC --param-c foo --TaskB-param-b bar --TaskA-param-a baz
Repetitive parameters have been eliminated, but at the cost of making the job’s command line interface slightly
clunkier. Often this is a reasonable trade-off.
But parameters can’t always be refactored out every class. Downstream tasks might also need to use some of those
parameters. For example, if TaskC needs to use param_a too, then param_a would still need to be repeated.
Approach 2: Use a common parameter class
class Params(luigi.Config):
param_c = luigi.Parameter()
param_b = luigi.Parameter()
param_a = luigi.Parameter()
class TaskB(Params):
def requires(self):
return TaskA()
class TaskB(Params):
def requires(self):
return TaskB()
This looks great at first glance, but a couple of issues lurk. Now TaskA and TaskB have unnecessary significant
parameters. Significant parameters help define the identity of a task. Identical tasks are prevented from running at the
same time by the central planner. This helps preserve the idempotent and atomic nature of luigi tasks. Unnecessary
significant task parameters confuse a task’s identity. Under the right circumstances, task identity confusion could lead
to that task running when it shouldn’t, or failing to run when it should.
This approach should only be used when all of the parameters of the config class, are significant (or all insignificant)
for all of its subclasses.
And wait a second. . . there’s a bug in the above code. See it?
TaskA won’t behave as an ExternalTask because the parent classes are specified in the wrong order. This con-
trived example is easy to fix (by swapping the ordering of the parents of TaskA), but real world cases can be more
difficult to both spot and fix. Inheriting from multiple classes derived from Task should be undertaken with caution
and avoided where possible.
Approach 3: Use inherits and requires
The inherits class decorator in this module copies parameters (and nothing else) from one task class to another,
and avoids direct pythonic inheritance.
import luigi
from luigi.util import inherits
class TaskA(luigi.ExternalTask):
param_a = luigi.Parameter()
def output(self):
return luigi.LocalTarget('/tmp/log-{t.param_a}'.format(t=self))
@inherits(TaskA)
class TaskB(luigi.Task):
(continues on next page)
def requires(self):
t = self.clone(TaskA) # or t = self.clone_parent()
return t
@inherits(TaskB)
class TaskC(luigi.Task):
param_c = luigi.Parameter()
def requires(self):
return self.clone(TaskB)
This totally eliminates the need to repeat parameters, avoids inheritance issues, and keeps the task command line
interface as simple (as it can be, anyway). Refactoring task parameters is also much easier.
The requires helper function can reduce this pattern even further. It does everything inherits does, and also
attaches a requires method to your task (still all without pythonic inheritance).
But how does it know how to invoke the upstream task? It uses clone() behind the scenes!
import luigi
from luigi.util import inherits, requires
class TaskA(luigi.ExternalTask):
param_a = luigi.Parameter()
def output(self):
return luigi.LocalTarget('/tmp/log-{t.param_a}'.format(t=self))
@requires(TaskA)
class TaskB(luigi.Task):
param_b = luigi.Parameter()
Use these helper functions effectively to avoid unnecessary repetition and dodge a few potentially nasty workflow
pitfalls at the same time. Brilliant!
luigi.util.common_params(task_instance, task_cls)
Grab all the values in task_instance that are found in task_cls.
class luigi.util.inherits(*tasks_to_inherit)
Bases: object
Task inheritance.
New after Luigi 2.7.6: multiple arguments support.
Usage:
class AnotherTask(luigi.Task):
m = luigi.IntParameter()
class YetAnotherTask(luigi.Task):
n = luigi.IntParameter()
@inherits(AnotherTask)
class MyFirstTask(luigi.Task):
def requires(self):
return self.clone_parent()
def run(self):
print self.m # this will be defined
# ...
@inherits(AnotherTask, YetAnotherTask)
class MySecondTask(luigi.Task):
def requires(self):
return self.clone_parents()
def run(self):
print self.n # this will be defined
# ...
class luigi.util.requires(*tasks_to_require)
Bases: object
Same as inherits, but also auto-defines the requires method.
New after Luigi 2.7.6: multiple arguments support.
class luigi.util.copies(task_to_copy)
Bases: object
Auto-copies a task.
Usage:
@copies(MyTask):
class CopyOfMyTask(luigi.Task):
def output(self):
return LocalTarget(self.date.strftime('/var/xyz/report-%Y-%m-%d'))
luigi.util.delegates(task_that_delegates)
Lets a task call methods on subtask(s).
The way this works is that the subtask is run as a part of the task, but the task itself doesn’t have to care about the
requirements of the subtasks. The subtask doesn’t exist from the scheduler’s point of view, and its dependencies
are instead required by the main task.
Example:
class PowersOfN(luigi.Task):
n = luigi.IntParameter()
def f(self, x): return x ** self.n
(continues on next page)
@delegates
class T(luigi.Task):
def subtasks(self): return PowersOfN(5)
def run(self): print self.subtasks().f(42)
luigi.util.previous(task)
Return a previous Task of the same family.
By default checks if this task family only has one non-global parameter and if it is a DateParameter, Date-
HourParameter or DateIntervalParameter in which case it returns with the time decremented by 1 (hour, day or
interval)
luigi.util.get_previous_completed(task, max_steps=10)
luigi.worker module
The worker communicates with the scheduler and does two things:
1. Sends all tasks that has to be run
2. Gets tasks from the scheduler that should be run
When running in local mode, the worker talks directly to a Scheduler instance. When you run a central server, the
worker will talk to the scheduler using a RemoteScheduler instance.
Everything in this module is private to luigi and may change in incompatible ways between versions. The exception
is the exception types and the worker config class.
exception luigi.worker.TaskException
Bases: exceptions.Exception
class luigi.worker.GetWorkResponse(task_id, running_tasks, n_pending_tasks,
n_unique_pending, n_pending_last_scheduled,
worker_state)
Bases: tuple
Create new instance of GetWorkResponse(task_id, running_tasks, n_pending_tasks, n_unique_pending,
n_pending_last_scheduled, worker_state)
n_pending_last_scheduled
Alias for field number 4
n_pending_tasks
Alias for field number 2
n_unique_pending
Alias for field number 3
running_tasks
Alias for field number 1
task_id
Alias for field number 0
worker_state
Alias for field number 5
exception luigi.worker.AsyncCompletionException(trace)
Bases: exceptions.Exception
Exception indicating that something went wrong with checking complete.
class luigi.worker.TracebackWrapper(trace)
Bases: object
Class to wrap tracebacks so we can know they’re not just strings.
luigi.worker.check_complete(task, out_queue)
Checks if task is complete, puts the result to out_queue.
class luigi.worker.worker(*args, **kwargs)
Bases: luigi.task.Config
id = Parameter (defaults to ): Override the auto-generated worker_id
ping_interval = FloatParameter (defaults to 1.0)
keep_alive = BoolParameter (defaults to False)
count_uniques = BoolParameter (defaults to False): worker-count-uniques means that we
count_last_scheduled = BoolParameter (defaults to False): Keep a worker alive only if
wait_interval = FloatParameter (defaults to 1.0)
wait_jitter = FloatParameter (defaults to 5.0)
max_keep_alive_idle_duration = TimeDeltaParameter (defaults to 0:00:00)
max_reschedules = IntParameter (defaults to 1)
timeout = IntParameter (defaults to 0)
task_limit = IntParameter (defaults to None)
retry_external_tasks = BoolParameter (defaults to False): If true, incomplete external
send_failure_email = BoolParameter (defaults to True): If true, send e-mails directly
no_install_shutdown_handler = BoolParameter (defaults to False): If true, the SIGUSR1
check_unfulfilled_deps = BoolParameter (defaults to True): If true, check for complete
check_complete_on_run = BoolParameter (defaults to False): If true, only mark tasks as
force_multiprocessing = BoolParameter (defaults to False): If true, use multiprocessin
task_process_context = OptionalParameter (defaults to None): If set to a fully qualifi
class luigi.worker.KeepAliveThread(scheduler, worker_id, ping_interval,
rpc_message_callback)
Bases: threading.Thread
Periodically tell the scheduler that the worker still lives.
stop()
run()
Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed
to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken
from the args and kwargs arguments, respectively.
luigi.worker.rpc_message_callback(fn)
class MyTask(luigi.Task):
count = luigi.IntParameter()
second_param = luigi.Parameter()
In addition to any declared properties and methods, there are a few non-declared properties, which are created
by the Register metaclass:
priority = 0
Priority of the task: the scheduler should favor available tasks with higher priority values first. See Task
priority
disabled = False
resources = {}
Resources used by the task. Should be formatted like {“scp”: 1} to indicate that the task requires 1 unit of
the scp resource.
worker_timeout = None
Number of seconds after which to time out the run function. No timeout if set to 0. Defaults to 0 or
worker-timeout value in config
max_batch_size = inf
Maximum number of tasks to run together as a batch. Infinite by default
batchable
True if this instance can be run as part of a batch. By default, True if it has any batched parameters
retry_count
Override this positive integer to have different retry_count at task level Check [scheduler]
disable_hard_timeout
Override this positive integer to have different disable_hard_timeout at task level. Check [sched-
uler]
disable_window_seconds
Override this positive integer to have different disable_window_seconds at task level. Check
[scheduler]
owner_email
Override this to send out additional error emails to task owner, in addition to the one defined in the
global configuration. This should return a string or a list of strings. e.g. ‘[email protected]’ or
[‘[email protected]’, ‘[email protected]’]
use_cmdline_section
Property used by core config such as –workers etc. These will be exposed without the class as prefix.
classmethod event_handler(event)
Decorator for adding event handlers.
trigger_event(event, *args, **kwargs)
Trigger that calls all of the specified events associated with this class.
accepts_messages
For configuring which scheduler messages can be received. When falsy, this tasks does not accept any
message. When True, all messages are accepted.
task_module
Returns what Python module to import to get access to this class.
task_namespace = '__not_user_specified'
This value can be overriden to set the namespace that will be used. (See Namespaces, families and
ids) If it’s not specified and you try to read this value anyway, it will return garbage. Please use
get_task_namespace() to read the namespace.
Note that setting this value with @property will not work, because this is a class level value.
classmethod get_task_namespace()
The task family for the given class.
Note: You normally don’t want to override this.
task_family = 'Task'
classmethod get_task_family()
The task family for the given class.
If task_namespace is not set, then it’s simply the name of the class. Otherwise,
<task_namespace>. is prefixed to the class name.
Note: You normally don’t want to override this.
classmethod get_params()
Returns all of the Parameters for this Task.
classmethod batch_param_names()
classmethod get_param_names(include_significant=False)
classmethod get_param_values(params, args, kwargs)
Get the values of the parameters from the args and kwargs.
Parameters
• params – list of (param_name, Parameter).
• args – positional arguments
• kwargs – keyword arguments.
Returns list of (name, value) tuples, one for each parameter.
param_args
initialized()
Returns True if the Task is initialized and False otherwise.
classmethod from_str_params(params_str)
Creates an instance from a str->str hash.
Parameters params_str – dict of param name -> value as string.
to_str_params(only_significant=False, only_public=False)
Convert all parameters to a str->str hash.
clone(cls=None, **kwargs)
Creates a new instance from an existing instance where some of the args have changed.
There’s at least two scenarios where this is useful (see test/clone_test.py):
• remove a lot of boiler plate when you have recursive dependencies and lots of args
• there’s task inheritance and some logic is on the base class
Parameters
• cls –
• kwargs –
Returns
complete()
If the task has any outputs, return True if all outputs exist. Otherwise, return False.
However, you may freely override this method with custom logic.
classmethod bulk_complete(parameter_tuples)
Returns those of parameter_tuples for which this Task is complete.
Override (with an efficient implementation) for efficient scheduling with range tools. Keep the logic
consistent with that of complete().
output()
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs
all exist. Subclasses should override this method to return a single Target or a list of Target instances.
Implementation note If running multiple workers, the output must be a resource that is accessible by all
workers, such as a DFS or database. Otherwise, workers might compute the same output since they
don’t see the work done by other workers.
See Task.output
requires()
The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any
other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method
to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
process_resources()
Override in “template” tasks which provide common resource functionality but allow subclasses to specify
additional resources while preserving the name for consistent end-user experience.
input()
Returns the outputs of the Tasks returned by requires()
See Task.input
Returns a list of Target objects which are specified as outputs of all required Tasks.
deps()
Internal method used by the scheduler.
Returns the flattened list of requires.
run()
The task run method, to be overridden in a subclass.
See Task.run
on_failure(exception)
Override for custom error handling.
This method gets called if an exception is raised in run(). The returned value of this method is json
encoded and sent to the scheduler as the expl argument. Its string representation will be used as the body
of the error email sent out if any.
Default behavior is to return a string representation of the stack trace.
on_success()
Override for doing custom completion handling for a larger class of tasks
This method gets called when run() completes without raising any exceptions.
The returned value is json encoded and sent to the scheduler as the expl argument.
Default behavior is to send an None value
no_unpicklable_properties(**kwds)
Remove unpicklable properties before dump task and resume them after.
This method could be called in subtask’s dump method, to ensure unpicklable properties won’t break
dump.
This method is a context-manager which can be called as below:
class luigi.Config(*args, **kwargs)
Bases: luigi.task.Task
Class for configuration. See Configuration classes.
class Task2(luigi.Task):
task_namespace = 'namespace2'
This explicit setting takes priority over whatever is set in the namespace() method, and it’s also inherited
through normal python inheritence.
There’s no equivalent way to set the task_family.
New since Luigi 2.6.0: scope keyword argument.
See also:
The new and better scaling auto_namespace()
luigi.auto_namespace(scope=”)
Same as namespace(), but instead of a constant namespace, it will be set to the __module__ of the task
class. This is desirable for these reasons:
• Two tasks with the same name will not have conflicting task families
• It’s more pythonic, as modules are Python’s recommended way to do namespacing.
• It’s traceable. When you see the full name of a task, you can immediately identify where it is defined.
We recommend calling this function from your package’s outermost __init__.py file. The file contents
could look like this:
import luigi
luigi.auto_namespace(scope=__name__)
0 When there are multiple levels of matching module scopes like a.b vs a.b.c, the more specific one (a.b.c) wins.
To reset an auto_namespace() call, you can use namespace(scope='my_scope'). But this will not
be needed (and is also discouraged) if you use the scope kwarg.
New since Luigi 2.6.0.
class luigi.Target
Bases: object
A Target is a resource generated by a Task.
For example, a Target might correspond to a file in HDFS or data in a database. The Target interface defines one
method that must be overridden: exists(), which signifies if the Target has been created or not.
Typically, a Task will define one or more Targets as output, and the Task is considered complete if and only if
each of its output Targets exist.
exists()
Returns True if the Target exists and False otherwise.
class luigi.LocalTarget(path=None, format=None, is_tmp=False)
Bases: luigi.target.FileSystemTarget
fs = <luigi.local_target.LocalFileSystem object>
makedirs()
Create all parent folders if they do not exist.
open(mode=’r’)
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified
mode.
Parameters mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas
w will open the FileSystemTarget in write mode. Subclasses can implement additional op-
tions. Using b is not supported; initialize with format=Nop instead.
move(new_path, raise_if_exists=False)
move_dir(new_path)
remove()
Remove the resource at the path specified by this FileSystemTarget.
This method is implemented by using fs.
copy(new_path, raise_if_exists=False)
fn
class luigi.RemoteScheduler(url=’http://localhost:8082/’, connect_timeout=None)
Bases: object
Scheduler proxy object. Talks to a RemoteSchedulerResponder.
add_scheduler_message_response(*args, **kwargs)
add_task(*args, **kwargs)
• add task identified by task_id if it doesn’t exist
• if deps is not None, update dependency list
• update status of task
• add additional workers/stakeholders
unpause(*args, **kwargs)
update_metrics_task_started(*args, **kwargs)
update_resource(*args, **kwargs)
update_resources(*args, **kwargs)
worker_list(*args, **kwargs)
exception luigi.RPCError(message, sub_exception=None)
Bases: exceptions.Exception
class luigi.Parameter(default=<object object>, is_global=False, significant=True, descrip-
tion=None, config_path=None, positional=True, always_in_help=False,
batch_method=None, visibility=<ParameterVisibility.PUBLIC: 0>)
Bases: object
Parameter whose value is a str, and a base class for other parameter types.
Parameters are objects set on the Task class level to make it possible to parameterize tasks. For instance:
class MyTask(luigi.Task):
foo = luigi.Parameter()
class RequiringTask(luigi.Task):
def requires(self):
return MyTask(foo="hello")
def run(self):
print(self.requires().foo) # prints "hello"
class MyTask(luigi.Task):
date = luigi.DateParameter()
def run(self):
templated_path = "/my/path/to/my/dataset/{date:%Y/%m/%d}/"
instantiated_path = templated_path.format(date=self.date)
# print(instantiated_path) --> /my/path/to/my/dataset/2016/06/09/
# ... use instantiated_path ...
To set this parameter to default to the current day. You can write code like this:
import datetime
class MyTask(luigi.Task):
date = luigi.DateParameter(default=datetime.date.today())
date_format = '%Y-%m-%d'
next_in_enumeration(value)
If your Parameter type has an enumerable ordering of values. You can choose to override this method.
This method is used by the luigi.execution_summary module for pretty printing purposes.
Enabling it to pretty print tasks like MyTask(num=1), MyTask(num=2), MyTask(num=3) to
MyTask(num=1..3).
Parameters value – The value
Returns The next value, like “value + 1”. Or None if there’s no enumerable ordering.
normalize(value)
Given a parsed parameter value, normalizes it.
The value can either be the result of parse(), the default value or arguments passed into the task’s construc-
tor by instantiation.
This is very implementation defined, but can be used to validate/clamp valid values. For example, if you
wanted to only accept even integers, and “correct” odd values to the nearest integer, you can implement
normalize as x // 2 * 2.
class luigi.MonthParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter.DateParameter
Parameter whose value is a date, specified to the month (day of date is “rounded” to first of the month).
A MonthParameter is a Date string formatted YYYY-MM. For example, 2013-07 specifies July of 2013. Task
objects constructed from code accept date (ignoring the day value) or Month.
date_format = '%Y-%m'
next_in_enumeration(value)
If your Parameter type has an enumerable ordering of values. You can choose to override this method.
This method is used by the luigi.execution_summary module for pretty printing purposes.
The interval parameter can be used to clamp this parameter to every N minutes, instead of every minute.
date_format = '%Y-%m-%dT%H%M'
deprecated_date_format = '%Y-%m-%dT%HH%M'
parse(s)
Parses a string to a datetime.
class luigi.DateSecondParameter(interval=1, start=None, **kwargs)
Bases: luigi.parameter._DatetimeParameterBase
Parameter whose value is a datetime specified to the second.
A DateSecondParameter is a ISO 8601 formatted date and time specified to the second. For example,
2013-07-10T190738 specifies July 10, 2013 at 19:07:38.
The interval parameter can be used to clamp this parameter to every N seconds, instead of every second.
date_format = '%Y-%m-%dT%H%M%S'
class luigi.DateIntervalParameter(default=<object object>, is_global=False, signifi-
cant=True, description=None, config_path=None, posi-
tional=True, always_in_help=False, batch_method=None,
visibility=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
A Parameter whose value is a DateInterval.
Date Intervals are specified using the ISO 8601 date notation for dates (eg. “2015-11-04”), months (eg. “2015-
05”), years (eg. “2015”), or weeks (eg. “2015-W35”). In addition, it also supports arbitrary date intervals
provided as two dates separated with a dash (eg. “2015-11-04-2015-12-04”).
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
serialize(x)
Converts datetime.timedelta to a string
Parameters x – the value to serialize.
class luigi.IntParameter(default=<object object>, is_global=False, significant=True, descrip-
tion=None, config_path=None, positional=True, always_in_help=False,
batch_method=None, visibility=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
Parameter whose value is an int.
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
parse(s)
Parses an int from the string using int().
next_in_enumeration(value)
If your Parameter type has an enumerable ordering of values. You can choose to override this method.
This method is used by the luigi.execution_summary module for pretty printing purposes.
Enabling it to pretty print tasks like MyTask(num=1), MyTask(num=2), MyTask(num=3) to
MyTask(num=1..3).
Parameters value – The value
Returns The next value, like “value + 1”. Or None if there’s no enumerable ordering.
class luigi.FloatParameter(default=<object object>, is_global=False, significant=True,
description=None, config_path=None, positional=True,
always_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
class MyTask(luigi.Task):
implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_
˓→PARSING)
explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_
˓→PARSING)
or globally by
luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
parse(input)
Parse a task_famly using the Register
serialize(cls)
Converts the luigi.task.Task (sub) class to its family name.
class luigi.ListParameter(default=<object object>, is_global=False, significant=True,
description=None, config_path=None, positional=True,
always_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
Parameter whose value is a list.
In the task definition, use
class MyTask(luigi.Task):
grades = luigi.ListParameter()
def run(self):
sum = 0
for element in self.grades:
sum += element
avg = sum / len(self.grades)
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
normalize(x)
Ensure that struct is recursively converted to a tuple so it can be hashed.
Parameters x (str) – the value to parse.
Returns the normalized (hashable/immutable) value.
parse(x)
Parse an individual value from the input.
Parameters x (str) – the value to parse.
Returns the parsed value.
serialize(x)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
class luigi.TupleParameter(default=<object object>, is_global=False, significant=True,
description=None, config_path=None, positional=True,
always_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.ListParameter
Parameter whose value is a tuple or tuple of tuples.
In the task definition, use
class MyTask(luigi.Task):
book_locations = luigi.TupleParameter()
def run(self):
for location in self.book_locations:
print("Go to page %d, line %d" % (location[0], location[1]))
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
parse(x)
Parse an individual value from the input.
Parameters x (str) – the value to parse.
Returns the parsed value.
class luigi.EnumParameter(*args, **kwargs)
Bases: luigi.parameter.Parameter
A parameter whose value is an Enum.
In the task definition, use
class Model(enum.Enum):
Honda = 1
Volvo = 2
class MyTask(luigi.Task):
my_param = luigi.EnumParameter(enum=Model)
parse(s)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
class MyTask(luigi.Task):
tags = luigi.DictParameter()
def run(self):
logging.info("Find server with role: %s", self.tags['role'])
server = aws.ec2.find_my_resource(self.tags)
It can be used to define dynamic parameters, when you do not know the exact list of your parameters (e.g. list
of tags, that are dynamically constructed outside Luigi), or you have a complex parameter containing logically
related values (like a database connection config).
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
normalize(value)
Ensure that dictionary parameter is converted to a FrozenOrderedDict so it can be hashed.
parse(source)
Parses an immutable and ordered dict from a JSON string using standard JSON library.
We need to use an immutable dictionary, to create a hashable parameter and also preserve the internal
structure of parsing. The traversal order of standard dict is undefined, which can result various string
representations of this parameter, and therefore a different task id for the task containing this parameter.
This is because task id contains the hash of parameters’ JSON representation.
Parameters s – String to be parse
serialize(x)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
luigi.run(*args, **kwargs)
Please dont use. Instead use luigi binary.
Run from cmdline using argparse.
Parameters use_dynamic_argparse – Deprecated and ignored
luigi.build(tasks, worker_scheduler_factory=None, detailed_summary=False, **env_params)
Run internally, bypassing the cmdline parsing.
Useful if you have some luigi code that you want to run internally. Example:
One notable difference is that build defaults to not using the identical process lock. Otherwise, build would only
be callable once from each process.
Parameters
• tasks –
• worker_scheduler_factory –
• env_params –
Returns True if there were no scheduling errors, even if tasks may fail.
class luigi.Event
Bases: object
DEPENDENCY_DISCOVERED = 'event.core.dependency.discovered'
DEPENDENCY_MISSING = 'event.core.dependency.missing'
DEPENDENCY_PRESENT = 'event.core.dependency.present'
BROKEN_TASK = 'event.core.task.broken'
START = 'event.core.start'
PROGRESS = 'event.core.progress'
This event can be fired by the task itself while running. The purpose is for the task to report progress,
metadata or any generic info so that event handler listening for this can keep track of the progress of
running task.
FAILURE = 'event.core.failure'
SUCCESS = 'event.core.success'
PROCESSING_TIME = 'event.core.processing_time'
TIMEOUT = 'event.core.timeout'
PROCESS_FAILURE = 'event.core.process_failure'
class luigi.NumericalParameter(left_op=<built-in function le>, right_op=<built-in function lt>,
*args, **kwargs)
Bases: luigi.parameter.Parameter
Parameter whose value is a number of the specified type, e.g. int or float and in the range specified.
In the task definition, use
class MyTask(luigi.Task):
my_param_1 = luigi.NumericalParameter(
var_type=int, min_value=-3, max_value=7) # -3 <= my_param_1 < 7
my_param_2 = luigi.NumericalParameter(
var_type=int, min_value=-3, max_value=7, left_op=operator.lt, right_
˓→op=operator.le) # -3 < my_param_2 <= 7
Parameters
• var_type (function) – The type of the input variable, e.g. int or float.
• min_value – The minimum value permissible in the accepted values range. May be
inclusive or exclusive based on left_op parameter. This should be the same type as var_type.
• max_value – The maximum value permissible in the accepted values range. May be in-
clusive or exclusive based on right_op parameter. This should be the same type as var_type.
• left_op (function) – The comparison operator for the left-most comparison in the
expression min_value left_op value right_op value. This operator should
generally be either operator.lt or operator.le. Default: operator.le.
• right_op (function) – The comparison operator for the right-most comparison in the
expression min_value left_op value right_op value. This operator should
generally be either operator.lt or operator.le. Default: operator.lt.
parse(s)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
class MyTask(luigi.Task):
my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)
Consider using EnumParameter for a typed, structured alternative. This class can perform the same role
when all choices are the same type and transparency of parameter value on the command line is desired.
Parameters
• var_type (function) – The type of the input variable, e.g. str, int, float, etc. Default:
str
• choices – An iterable, all of whose elements are of var_type to restrict parameter choices
to.
parse(s)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
normalize(var)
Given a parsed parameter value, normalizes it.
The value can either be the result of parse(), the default value or arguments passed into the task’s construc-
tor by instantiation.
This is very implementation defined, but can be used to validate/clamp valid values. For example, if you
wanted to only accept even integers, and “correct” odd values to the nearest integer, you can implement
normalize as x // 2 * 2.
class luigi.OptionalParameter(default=<object object>, is_global=False, significant=True,
description=None, config_path=None, positional=True,
always_in_help=False, batch_method=None, visibil-
ity=<ParameterVisibility.PUBLIC: 0>)
Bases: luigi.parameter.Parameter
A Parameter that treats empty string as None
Parameters
• default – the default value for this parameter. This should match the type of the Pa-
rameter, i.e. datetime.date for DateParameter or int for IntParameter. By
default, no default is stored and the value must be specified at runtime.
• significant (bool) – specify False if the parameter should not be treated as part of
the unique identifier for a Task. An insignificant Parameter might also be used to specify a
password or other sensitive information that should not be made public via the scheduler.
Default: True.
• description (str) – A human-readable string describing the purpose of this Parameter.
For command-line invocations, this will be used as the help string shown to users. Default:
None.
• config_path (dict) – a dictionary with entries section and name specifying a con-
fig file entry from which to read the default value for this parameter. DEPRECATED. De-
fault: None.
• positional (bool) – If true, you can set the argument as a positional argument. It’s true
by default but we recommend positional=False for abstract base classes and similar
cases.
• always_in_help (bool) – For the –help option in the command line parsing. Set true
to always show in –help.
• batch_method (function(iterable[A])->A) – Method to combine an iterable
of parsed parameter values into a single value. Used when receiving batched parameter lists
from the scheduler. See Batching multiple parameter values into a single run
• visibility – A Parameter whose value is a ParameterVisibility. Default value
is ParameterVisibility.PUBLIC
serialize(x)
Opposite of parse().
Converts the value x to a string.
Parameters x – the value to serialize.
parse(x)
Parse an individual value from the input.
The default implementation is the identity function, but subclasses should override this method for spe-
cialized parsing.
Parameters x (str) – the value to parse.
Returns the parsed value.
class luigi.LuigiStatusCode
Bases: enum.Enum
All possible status codes for the attribute status in LuigiRunResult when the argument
detailed_summary=True in luigi.run() / luigi.build. Here are the codes and what they mean:
SUCCESS_WITH_RETRY = (':)', 'there were failed tasks but they all succeeded in a retry
FAILED = (':(', 'there were failed tasks')
FAILED_AND_SCHEDULING_FAILED = (':(', 'there were failed tasks and tasks whose scheduli
SCHEDULING_FAILED = (':(', 'there were tasks whose scheduling failed')
NOT_RUN = (':|', 'there were tasks that were not granted run permission by the schedule
MISSING_EXT = (':|', 'there were missing external dependencies')
• genindex
• modindex
• search
l luigi.contrib.hdfs.webhdfs_client, 75
luigi, 224 luigi.contrib.hive, 109
luigi.batch_notifier, 163 luigi.contrib.kubernetes, 112
luigi.cmdline, 164 luigi.contrib.lsf, 114
luigi.cmdline_parser, 164 luigi.contrib.lsf_runner, 115
luigi.configuration, 65 luigi.contrib.mongodb, 115
luigi.configuration.base_parser, 63 luigi.contrib.mrrunner, 117
luigi.configuration.cfg_parser, 64 luigi.contrib.mssqldb, 117
luigi.configuration.core, 65 luigi.contrib.mysqldb, 118
luigi.configuration.toml_parser, 65 luigi.contrib.opener, 119
luigi.contrib, 157 luigi.contrib.pai, 121
luigi.contrib.batch, 77 luigi.contrib.pig, 124
luigi.contrib.beam_dataflow, 78 luigi.contrib.postgres, 125
luigi.contrib.bigquery, 81 luigi.contrib.presto, 127
luigi.contrib.bigquery_avro, 87 luigi.contrib.pyspark_runner, 129
luigi.contrib.datadog_metric, 88 luigi.contrib.rdbms, 129
luigi.contrib.dataproc, 88 luigi.contrib.redis_store, 132
luigi.contrib.docker_runner, 90 luigi.contrib.redshift, 132
luigi.contrib.dropbox, 91 luigi.contrib.s3, 137
luigi.contrib.ecs, 93 luigi.contrib.salesforce, 141
luigi.contrib.esindex, 95 luigi.contrib.scalding, 143
luigi.contrib.external_daily_snapshot, luigi.contrib.sge, 145
98 luigi.contrib.sge_runner, 147
luigi.contrib.external_program, 99 luigi.contrib.simulate, 148
luigi.contrib.ftp, 100 luigi.contrib.spark, 148
luigi.contrib.gcp, 101 luigi.contrib.sparkey, 150
luigi.contrib.gcs, 102 luigi.contrib.sqla, 150
luigi.contrib.hadoop, 104 luigi.contrib.ssh, 154
luigi.contrib.hadoop_jar, 108 luigi.contrib.target, 156
luigi.contrib.hdfs, 76 luigi.contrib.webhdfs, 156
luigi.contrib.hdfs.abstract_client, 66 luigi.date_interval, 164
luigi.contrib.hdfs.clients, 67 luigi.db_task_history, 167
luigi.contrib.hdfs.config, 68 luigi.event, 168
luigi.contrib.hdfs.error, 69 luigi.execution_summary, 169
luigi.contrib.hdfs.format, 69 luigi.format, 170
luigi.contrib.hdfs.hadoopcli_clients, luigi.freezing, 172
70 luigi.interface, 172
luigi.contrib.hdfs.snakebite_client, 71 luigi.local_target, 173
luigi.contrib.hdfs.target, 74 luigi.lock, 175
luigi.metrics, 175
249
Luigi Documentation, Release 2.8.13
luigi.mock, 176
luigi.notifications, 177
luigi.parameter, 179
luigi.process, 196
luigi.retcodes, 196
luigi.rpc, 197
luigi.scheduler, 198
luigi.server, 203
luigi.setup_logging, 206
luigi.target, 206
luigi.task, 209
luigi.task_history, 215
luigi.task_register, 216
luigi.task_status, 217
luigi.tools, 163
luigi.tools.deps, 157
luigi.tools.deps_tree, 157
luigi.tools.luigi_grep, 158
luigi.tools.range, 158
luigi.util, 217
luigi.worker, 221
251
Luigi Documentation, Release 2.8.13
attribute), 85 method), 80
allowed_headers (luigi.server.cors attribute), 203 args() (luigi.contrib.hadoop_jar.HadoopJarJobTask
allowed_kwargs (luigi.contrib.opener.LocalOpener method), 109
attribute), 120 args() (luigi.contrib.scalding.ScaldingJobTask
allowed_kwargs (luigi.contrib.opener.MockOpener method), 145
attribute), 120 assistant (luigi.interface.core attribute), 173
allowed_kwargs (luigi.contrib.opener.Opener assistant (luigi.scheduler.Worker attribute), 200
attribute), 120 AsyncCompletionException, 223
allowed_kwargs (luigi.contrib.opener.S3Opener at- atomic_file (class in luigi.local_target), 173
tribute), 121 atomic_output() (luigi.contrib.hadoop_jar.HadoopJarJobTask
allowed_methods (luigi.server.cors attribute), 203 method), 108
allowed_origins (luigi.server.cors attribute), 203 atomic_output() (luigi.contrib.scalding.ScaldingJobTask
AllRunHandler (class in luigi.server), 204 method), 145
already_running (luigi.retcodes.retcode attribute), AtomicFtpFile (class in luigi.contrib.ftp), 101
196 AtomicGCSFile (class in luigi.contrib.gcs), 103
always_log_stderr AtomicLocalFile (class in luigi.target), 209
(luigi.contrib.external_program.ExternalProgramTask AtomicRemoteFileWriter (class in
attribute), 99 luigi.contrib.ssh), 155
always_log_stderr AtomicS3File (class in luigi.contrib.s3), 139
(luigi.contrib.spark.SparkSubmitTask at- AtomicWebHdfsFile (class in luigi.contrib.webhdfs),
tribute), 148 156
AMBIGUOUS_CLASS (luigi.task_register.Register AtomicWritableDropboxFile (class in
attribute), 216 luigi.contrib.dropbox), 92
announce_scheduling_failure() attach() (in module luigi.contrib.hadoop), 104
(luigi.RemoteScheduler method), 230 auth_file_path (luigi.contrib.pai.PaiTask attribute),
announce_scheduling_failure() 123
(luigi.rpc.RemoteScheduler method), 197 auth_method (luigi.contrib.kubernetes.kubernetes at-
announce_scheduling_failure() tribute), 112
(luigi.scheduler.Scheduler method), 202 auth_method (luigi.contrib.kubernetes.KubernetesJobTask
ApacheHiveCommandClient (class in attribute), 112
luigi.contrib.hive), 109 authFile (luigi.contrib.pai.PaiJob attribute), 122
api_key (luigi.contrib.datadog_metric.datadog at- auto_namespace() (in module luigi), 228
tribute), 88 auto_namespace() (in module luigi.task), 209
API_NS (luigi.contrib.salesforce.SalesforceAPI at- auto_remove (luigi.contrib.docker_runner.DockerTask
tribute), 142 attribute), 90
API_VERSION (luigi.contrib.salesforce.SalesforceAPI autocommit (luigi.contrib.rdbms.Query attribute), 131
attribute), 142 autoscaling_algorithm
apikey (luigi.notifications.sendgrid attribute), 178 (luigi.contrib.beam_dataflow.BeamDataflowJobTask
app (luigi.contrib.spark.PySparkTask attribute), 149 attribute), 80
app (luigi.contrib.spark.SparkSubmitTask attribute), 148 autoscaling_algorithm
app() (in module luigi.server), 205 (luigi.contrib.beam_dataflow.DataflowParamKeys
app_command() (luigi.contrib.spark.PySparkTask attribute), 78
method), 150 AVRO (luigi.contrib.bigquery.DestinationFormat at-
app_command() (luigi.contrib.spark.SparkSubmitTask tribute), 82
method), 149 AVRO (luigi.contrib.bigquery.SourceFormat attribute), 81
app_key (luigi.contrib.datadog_metric.datadog at-
tribute), 88 B
app_options() (luigi.contrib.spark.SparkSubmitTask backoff_limit (luigi.contrib.kubernetes.KubernetesJobTask
method), 148 attribute), 113
apply_async() (luigi.worker.SingleProcessPool BaseHadoopJobTask (class in luigi.contrib.hadoop),
method), 222 106
archives (luigi.contrib.spark.SparkSubmitTask at- BaseLogging (class in luigi.setup_logging), 206
tribute), 149 BaseParser (class in luigi.configuration.base_parser),
args() (luigi.contrib.beam_dataflow.BeamDataflowJobTask 63
252 Index
Luigi Documentation, Release 2.8.13
Index 253
Luigi Documentation, Release 2.8.13
254 Index
Luigi Documentation, Release 2.8.13
Index 255
Luigi Documentation, Release 2.8.13
256 Index
Luigi Documentation, Release 2.8.13
Index 257
Luigi Documentation, Release 2.8.13
258 Index
Luigi Documentation, Release 2.8.13
Index 259
Luigi Documentation, Release 2.8.13
260 Index
Luigi Documentation, Release 2.8.13
Index 261
Luigi Documentation, Release 2.8.13
262 Index
Luigi Documentation, Release 2.8.13
Index 263
Luigi Documentation, Release 2.8.13
264 Index
Luigi Documentation, Release 2.8.13
(luigi.batch_notifier.batch_email attribute),handle_task_started()
164 (luigi.metrics.MetricsCollector method),
GZIP (luigi.contrib.bigquery.Compression attribute), 82 176
GzipFormat (class in luigi.format), 171 handle_task_started()
(luigi.metrics.NoMetricsCollector method),
H 176
hadoop (class in luigi.contrib.hadoop), 104 has_active_session()
hadoop_conf_dir (luigi.contrib.spark.SparkSubmitTask (luigi.contrib.salesforce.SalesforceAPI
attribute), 149 method), 142
hadoop_user_name (luigi.contrib.spark.SparkSubmitTask has_excessive_failures() (luigi.scheduler.Task
attribute), 148 method), 200
hadoopcli (class in luigi.contrib.hdfs.config), 68 has_option() (luigi.configuration.cfg_parser.LuigiConfigParser
HadoopJarJobError, 108 method), 64
HadoopJarJobRunner (class in has_option() (luigi.configuration.LuigiConfigParser
luigi.contrib.hadoop_jar), 108 method), 65
HadoopJarJobTask (class in has_option() (luigi.configuration.LuigiTomlParser
luigi.contrib.hadoop_jar), 108 method), 66
HadoopJobError, 105 has_option() (luigi.configuration.toml_parser.LuigiTomlParser
HadoopJobRunner (class in luigi.contrib.hadoop), method), 65
105 has_task() (luigi.scheduler.SimpleTaskState method),
HadoopRunContext (class in luigi.contrib.hadoop), 201
105 has_task_history() (luigi.RemoteScheduler
handle_interrupt() (luigi.worker.Worker method), 230
method), 224 has_task_history() (luigi.rpc.RemoteScheduler
handle_task_disabled() method), 198
has_task_history()
(luigi.contrib.datadog_metric.DatadogMetricsCollector (luigi.scheduler.Scheduler
method), 88 method), 202
handle_task_disabled() has_task_value() (luigi.Parameter method), 232
(luigi.metrics.MetricsCollector method), has_task_value() (luigi.parameter.Parameter
176 method), 181
handle_task_disabled() has_value (luigi.parameter.ParameterVisibility
(luigi.metrics.NoMetricsCollector method), attribute), 179
176 hdfs (class in luigi.contrib.hdfs.config), 68
handle_task_done() hdfs_reader() (luigi.contrib.hdfs.format.CompatibleHdfsFormat
(luigi.contrib.datadog_metric.DatadogMetricsCollector method), 70
method), 88 hdfs_reader() (luigi.contrib.hdfs.format.PlainDirFormat
handle_task_done() method), 69
(luigi.metrics.MetricsCollector method), hdfs_reader() (luigi.contrib.hdfs.format.PlainFormat
176 method), 69
handle_task_done() hdfs_writer() (luigi.contrib.hdfs.format.CompatibleHdfsFormat
(luigi.metrics.NoMetricsCollector method), method), 70
176 hdfs_writer() (luigi.contrib.hdfs.format.PlainDirFormat
handle_task_failed() method), 69
hdfs_writer() (luigi.contrib.hdfs.format.PlainFormat
(luigi.contrib.datadog_metric.DatadogMetricsCollector
method), 88 method), 69
handle_task_failed() HdfsAtomicWriteDirPipe (class in
(luigi.metrics.MetricsCollector method), luigi.contrib.hdfs.format), 69
176 HdfsAtomicWriteError, 69
handle_task_failed() HdfsAtomicWritePipe (class in
(luigi.metrics.NoMetricsCollector method), luigi.contrib.hdfs.format), 69
176 HdfsClient (class in
handle_task_started() luigi.contrib.hdfs.hadoopcli_clients), 70
HdfsClientApache1
(luigi.contrib.datadog_metric.DatadogMetricsCollector (class in
method), 88 luigi.contrib.hdfs.hadoopcli_clients), 71
Index 265
Luigi Documentation, Release 2.8.13
266 Index
Luigi Documentation, Release 2.8.13
Index 267
Luigi Documentation, Release 2.8.13
268 Index
Luigi Documentation, Release 2.8.13
Index 269
Luigi Documentation, Release 2.8.13
270 Index
Luigi Documentation, Release 2.8.13
Index 271
Luigi Documentation, Release 2.8.13
272 Index
Luigi Documentation, Release 2.8.13
Index 273
Luigi Documentation, Release 2.8.13
274 Index
Luigi Documentation, Release 2.8.13
Index 275
Luigi Documentation, Release 2.8.13
276 Index
Luigi Documentation, Release 2.8.13
Index 277
Luigi Documentation, Release 2.8.13
278 Index
Luigi Documentation, Release 2.8.13
Index 279
Luigi Documentation, Release 2.8.13
280 Index
Luigi Documentation, Release 2.8.13
run() (luigi.contrib.sparkey.SparkeyExportTask S
method), 150 s3 (luigi.contrib.s3.S3Client attribute), 137
run() (luigi.contrib.sqla.CopyToTable method), 154 s3_load_path() (luigi.contrib.redshift.S3CopyToTable
run() (luigi.notifications.TestNotificationsTask method), 133
method), 177 s3_unload_path (luigi.contrib.redshift.RedshiftUnloadTask
run() (luigi.Task method), 227 attribute), 136
run() (luigi.task.Task method), 213 S3Client (class in luigi.contrib.s3), 137
run() (luigi.worker.ContextManagedTaskProcess S3CopyJSONToTable (class in luigi.contrib.redshift),
method), 222 134
run() (luigi.worker.KeepAliveThread method), 223 S3CopyToTable (class in luigi.contrib.redshift), 132
run() (luigi.worker.TaskProcess method), 222 S3EmrTarget (class in luigi.contrib.s3), 140
run() (luigi.worker.Worker method), 224 S3EmrTask (class in luigi.contrib.s3), 140
run_and_track_hadoop_job() (in module S3FlagTarget (class in luigi.contrib.s3), 139
luigi.contrib.hadoop), 105 S3FlagTask (class in luigi.contrib.s3), 140
run_combiner() (luigi.contrib.hadoop.JobTask S3Opener (class in luigi.contrib.opener), 120
method), 108 S3PathTask (class in luigi.contrib.s3), 140
run_hive() (in module luigi.contrib.hive), 109 S3Target (class in luigi.contrib.s3), 139
run_hive_cmd() (in module luigi.contrib.hive), 109 salesforce (class in luigi.contrib.salesforce), 141
run_hive_script() (in module luigi.contrib.hive), SalesforceAPI (class in luigi.contrib.salesforce), 142
109 sample() (luigi.contrib.hadoop.LocalJobRunner
run_job (luigi.contrib.hadoop.JobRunner attribute), method), 106
105 sandbox_name (luigi.contrib.salesforce.QuerySalesforce
run_job() (luigi.contrib.bigquery.BigQueryClient attribute), 141
method), 84 save_job_info (luigi.contrib.lsf.LSFJobTask at-
run_job() (luigi.contrib.hadoop.HadoopJobRunner tribute), 114
method), 105 sb_security_token
run_job() (luigi.contrib.hadoop.LocalJobRunner (luigi.contrib.salesforce.salesforce attribute),
method), 106 141
run_job() (luigi.contrib.hadoop_jar.HadoopJarJobRunner sc (luigi.contrib.pyspark_runner.SparkContextEntryPoint
method), 108 attribute), 129
run_job() (luigi.contrib.hive.HiveQueryRunner ScaldingJobRunner (class in
method), 111 luigi.contrib.scalding), 144
run_job() (luigi.contrib.scalding.ScaldingJobRunner ScaldingJobTask (class in luigi.contrib.scalding),
method), 144 144
run_locally (luigi.contrib.sge.SGEJobTask at- Scheduler (class in luigi.scheduler), 201
tribute), 147 scheduler (class in luigi.scheduler), 199
run_mapper() (luigi.contrib.hadoop.JobTask scheduler_host (luigi.interface.core attribute), 172
method), 108 scheduler_port (luigi.interface.core attribute), 172
run_reducer() (luigi.contrib.hadoop.JobTask scheduler_url (luigi.interface.core attribute), 172
method), 108 SchedulerMessage (class in luigi.worker), 222
run_with_retcodes() (in module luigi.retcodes), scheduling_error (luigi.retcodes.retcode attribute),
197 196
RunAnywayTarget (class in luigi.contrib.simulate), SCHEDULING_FAILED
148 (luigi.execution_summary.LuigiStatusCode
Runner (class in luigi.contrib.mrrunner), 117 attribute), 169
runner (luigi.contrib.beam_dataflow.BeamDataflowJobTask SCHEDULING_FAILED (luigi.LuigiStatusCode at-
attribute), 80 tribute), 248
runner (luigi.contrib.beam_dataflow.DataflowParamKeys schema (luigi.contrib.bigquery.BigQueryLoadTask at-
attribute), 78 tribute), 85
running_tasks (luigi.worker.GetWorkResponse at- schema (luigi.contrib.presto.PrestoTask attribute), 128
tribute), 221 schema (luigi.contrib.sqla.CopyToTable attribute), 153
runtime_flag (luigi.contrib.lsf.LSFJobTask at- security_token (luigi.contrib.salesforce.salesforce
tribute), 114 attribute), 141
Index 281
Luigi Documentation, Release 2.8.13
282 Index
Luigi Documentation, Release 2.8.13
Index 283
Luigi Documentation, Release 2.8.13
284 Index
Luigi Documentation, Release 2.8.13
Index 285
Luigi Documentation, Release 2.8.13
286 Index
Luigi Documentation, Release 2.8.13
Index 287
Luigi Documentation, Release 2.8.13
Y
Year (class in luigi.date_interval), 166
YearParameter (class in luigi), 234
YearParameter (class in luigi.parameter), 183
Z
zone (luigi.contrib.beam_dataflow.BeamDataflowJobTask
attribute), 80
288 Index