vsupalov


I'm New to Luigi, What's Good to Know About?

Important concepts and best practices which will help you get started and become comfortable with the tool faster.

February 8, 2016 [ luigi ]

You’re new to the Luigi framework, and already pretty close to falling in love with it. At the moment, you are using it akin to an ETL tool but would like to know what else it can be used for. After all, it’s supposed to be general purpose, right?

Right now there’s little time to dive into the topic in-depth, but it would still be good to know the ins-and-outs of slightly hidden but useful concepts you might have overseen. The following list is a mix of common misconceptions and non-widely advertised features which you should know about. It is not exhaustive, nor too well structured.

In addition, this information does not substitute for a thorough approach to the topic, but knowing these can help you save time and understand the tool better until you get around to it.

Basics

  • Atomic file operations are darn important to keep in mind, especially if you write your own tasks.
  • As is idempotency.
  • Tasks will not run again if the output target already exists. Nor will they call tasks which they depend on.

The Central Scheduler

  • Is not doing any cron-like scheduling, nor does it do any actual data crunching.
  • It’s okay to run the same data flows multiple timees on different machines, as long as there is one central scheduler. It it is single threaded and will prevent duplicate tasks from being started.
  • A valid approach in production is to just start the same tasks every minute with cron. This comes in handy if the data coming in is not under your control.

Often Overlooked

  • Try to reuse as much existing tasks as possible. This will save you from reinventing the wheel and making nasty mistakes. For example, there are BigQuery LoadTask and Target classes.
  • You can run multiple tasks concurrently on a single machine if the graph permits it, by providing a –workers argument with a number larger than 1.
  • You can use dummy files as checkpoints. If they are located in a distributed file system, you can coordinate work on multiple worker machines.
  • It is possible to build dynamic piplines, yielding tasks from the run method. besides the return values of requires().

Want to be notified when I publish new content?

Just enter your name and email below. You will also get content that I share exclusively with the list, and zero spam!