vsupalov

I'm New to Luigi, What's Good to Know About?

February 08, 2016

You’re new to the Luigi framework, and already pretty close to falling in love with it. At the moment, you are using it akin to an ETL tool but would like to know what else it can be used for. After all, it’s supposed to be general purpose, right?

Right now there’s little time to dive into the topic in-depth, but it would still be good to know the ins-and-outs of slightly hidden but useful concepts you might have overseen. The following list is a mix of common misconceptions and non-widely advertised features which you should know about. It is not exhaustive, nor too well structured.

In addition, this information does not substitute for a thorough approach to the topic, but knowing these can help you save time and understand the tool better until you get around to it.

Basics

  • Atomic file operations are darn important to keep in mind, especially if you write your own tasks.
  • As is idempotency.
  • Tasks will not run again if the output target already exists. Nor will they call tasks which they depend on.

The Central Scheduler

  • Is not doing any cron-like scheduling, nor does it do any actual data crunching.
  • It’s okay to run the same data flows multiple timees on different machines, as long as there is one central scheduler. It it is single threaded and will prevent duplicate tasks from being started.
  • A valid approach in production is to just start the same tasks every minute with cron. This comes in handy if the data coming in is not under your control.

Often Overlooked

  • Try to reuse as much existing tasks as possible. This will save you from reinventing the wheel and making nasty mistakes. For example, there are BigQuery LoadTask and Target classes.
  • You can run multiple tasks concurrently on a single machine if the graph permits it, by providing a –workers argument with a number larger than 1.
  • You can use dummy files as checkpoints. If they are located in a distributed file system, you can coordinate work on multiple worker machines.
  • It is possible to build dynamic piplines, yielding tasks from the run method. besides the return values of requires().

Join the mailing list!


Subscribe to get notified about future articles and stay in touch via email.

I write about Django, Kubernetes, Docker, automation- and deployment topics, but would also like to keep you up to date about news around the business-side of things.

Privacy and your data: You can get more information about the usage of your data, the storage of your registration, sending out mails with the US-provider ConvertKit, statistical analysis of emails sent and your possibility to unsubscribe in my Privacy Policy.

I use the US-provider ConvertKit for email automation. By clicking to submit this form, you acknowledge that the information you provide will be transferred to ConvertKit for processing in accordance with their Privacy Policy and Terms.

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit