Forcing Luigi to Rerun Single Tasks or the Complete Pipeline

How to make sure that your Luigi Task runs again, despite having finished successfully before?

January 11, 2016 [ luigi ]

You are in the process of developing a new Luigi task for your data pipeline. It ran successfully and the output looked right. Whoops, not quite. Your code needed to be fixed, or a piece of the data which is used has changed since. The current output is wrong, unless you run it again. But when trying to do so, Luigi refuses to do any work saying that all is well:

Did not run any tasks
This progress looks :) because there were no failed tasks or missing external dependencies

Is the scheduler at wrong here, refusing to run something which has recently succeeded? Or is it the fault of the parameters you provided? Maybe a small change to those will help? Nope, saving work when an output exists is intended, and one of Luigi’s core benefits to make huge data pipelines simple to work with.

Despite the frequent comparisons to make, Luigi does not care whether the source data has changed, nor does it pay any regard to changed code. This is due to the fact, that Luigi tasks are assumed to be idempotent. It is a handy assumption to save a lot of work in the long run, but it also means that as soon as an output is produced it is considered to be final and correct ever after.

To force a single task to run again, remove its output file and call it directly. In the same vein, the best way to make sure your complete pipeline runs again, is removing all intermediate and final outputs (files, maybe even complete database tables) which it produces. Watch out: a single existing output in the middle of your pipeline will prevent any downstream tasks from running, even if their respective output files don’t exist, as they are not strictly speaking needed to produce the final result. Happy plumbing!

Get emails from me: