science

Lessons learned while adopting Snakemake

Published: 2020-01-12 4 minute read

Recently, I decided to adopt the Snakemake workflow system, as I had started to hit some pain points in my project that it reportedly solves. While I believe that this was ultimately a great decision in the long term, it did take me a week to fully move things over when I had expected it to just take a day. Much of this work was adapting my code to the framework and was really only necessary because I was not adhering to the Unix philosophy in the first place, which is entirely my fault, but Snakemake fortunately incentivizes this. That said, it was very frustrating at points, which I think is a combination of the inherent complexity of the problem it's solving, some documentation issues, and unhelpful error messages. I decided to describe these problems and how I solved them.

Input referenced a variable that wasn't referenced by the output

This was the most conceptually surprising part of the framework for me. I had assumed that if you had a chain of rules, that one rule's input could reference the output of the previous rule using a pattern. However, this is only true if any variables that the input section has are also used in the output! Even when directly using rules.some_rule.output as the input, Snakemake couldn't figure out what I meant, since it had a variable {run} that I wasn't using in the output filename. The solution was to refactor how I was building up the file paths, so that the unique thing that identifies each dataset is the only string represented by the variable.

A pseudo-rule that lists every file you want to make is required

One of the more confusing aspects of Snakemake to me (which effectively explains the previous issue) is that while it can use globbing to infer all the inputs to a rule, it can only do that because the outputs are explicitly defined my subsequent rules, with an all-encompassing pseudo-rule (usually called rule all) at the very end. My mental framework going in was working in the opposite direction and assumed it was just inferring what work was necessary, but this isn't the case. I feel like this should be a solvable problem but I have a gut feeling it runs into some sort of halting problem issues.

An input is not necessary

Virtually every example rule in the documentation has an input, but this isn't strictly necessary. At the beginning of my pipeline I take sequencing data and merge the paired end reads, and while the script that's run does take actual inputs, there was no reason to reference them.

Referencing a directory that contains outputs as an input will cause Snakemake to rerun the pipeline unnecessarily

I had mistakenly referenced an output directory as an input to a rule when it was completely unnecessary (it was actually the input mentioned in the previous section that wasn't necessary at all). Because files were being written to that directory, the last-modified timestamp of the directory was being updated, so it always appeared to Snakemake that the earlier rule had a fresher input than the rest of the pipeline. Unfortunately, this was also my very first rule, so the entire pipeline was running from scratch even after it had completed, which took several hours. Although this is not as likely to happen to other people, it's definitely what motivated this post.

Script sections referenced variables that are surprisingly out of scope

I had rules with a script: section, and I tried to reference a variable called {run} that was used in the input and output sections. The solutions is simply to use {wildcards.run}, but the error message doesn't make this clear. This is in the documentation so I was fundamentally the problem there.

Not escaping Python formatting properly

When using .format() to build up strings in my Snakefile, I didn't realize that you had to escape the Snakemake variables with double curly braces. This has nothing to do with Snakemake, I was just not too familiar with how the built-in format() function worked. However, the error message didn't make it clear that this was a Python syntax issue and not a Snakemake issue.

Conclusions

I feel that there's probably a better way to present the logic of Snakemake to new users such that they won't encounter these errors, and in some cases the error messaging might be able to be improved. Overall, however, this is so much better than GNU Make for a host of reasons, and is orders of magnitude better than not having any build system at all. Once I get enough experience with it I might write a tutorial or propose changes to the official documentation.

Fission yeast don't age

Published: 2017-01-31 1 minute read

My colleagues and I just published our work on fission yeast aging in eLIFE. Although we had hoped to develop a more human-like single-celled model organism to study eukaryotic aging, we came to the surprising conclusion that fission yeast simply don't age. That is, the chance of a fission yeast dying is constant throughout its lifespan. Still, it's a fascinating result.