programming

Lessons learned while adopting Snakemake

Published: 2020-01-12 4 minute read

Recently, I decided to adopt the Snakemake workflow system, as I had started to hit some pain points in my project that it reportedly solves. While I believe that this was ultimately a great decision in the long term, it did take me a week to fully move things over when I had expected it to just take a day. Much of this work was adapting my code to the framework and was really only necessary because I was not adhering to the Unix philosophy in the first place, which is entirely my fault, but Snakemake fortunately incentivizes this. That said, it was very frustrating at points, which I think is a combination of the inherent complexity of the problem it's solving, some documentation issues, and unhelpful error messages. I decided to describe these problems and how I solved them.

Input referenced a variable that wasn't referenced by the output

This was the most conceptually surprising part of the framework for me. I had assumed that if you had a chain of rules, that one rule's input could reference the output of the previous rule using a pattern. However, this is only true if any variables that the input section has are also used in the output! Even when directly using rules.some_rule.output as the input, Snakemake couldn't figure out what I meant, since it had a variable {run} that I wasn't using in the output filename. The solution was to refactor how I was building up the file paths, so that the unique thing that identifies each dataset is the only string represented by the variable.

A pseudo-rule that lists every file you want to make is required

One of the more confusing aspects of Snakemake to me (which effectively explains the previous issue) is that while it can use globbing to infer all the inputs to a rule, it can only do that because the outputs are explicitly defined my subsequent rules, with an all-encompassing pseudo-rule (usually called rule all) at the very end. My mental framework going in was working in the opposite direction and assumed it was just inferring what work was necessary, but this isn't the case. I feel like this should be a solvable problem but I have a gut feeling it runs into some sort of halting problem issues.

An input is not necessary

Virtually every example rule in the documentation has an input, but this isn't strictly necessary. At the beginning of my pipeline I take sequencing data and merge the paired end reads, and while the script that's run does take actual inputs, there was no reason to reference them.

Referencing a directory that contains outputs as an input will cause Snakemake to rerun the pipeline unnecessarily

I had mistakenly referenced an output directory as an input to a rule when it was completely unnecessary (it was actually the input mentioned in the previous section that wasn't necessary at all). Because files were being written to that directory, the last-modified timestamp of the directory was being updated, so it always appeared to Snakemake that the earlier rule had a fresher input than the rest of the pipeline. Unfortunately, this was also my very first rule, so the entire pipeline was running from scratch even after it had completed, which took several hours. Although this is not as likely to happen to other people, it's definitely what motivated this post.

Script sections referenced variables that are surprisingly out of scope

I had rules with a script: section, and I tried to reference a variable called {run} that was used in the input and output sections. The solutions is simply to use {wildcards.run}, but the error message doesn't make this clear. This is in the documentation so I was fundamentally the problem there.

Not escaping Python formatting properly

When using .format() to build up strings in my Snakefile, I didn't realize that you had to escape the Snakemake variables with double curly braces. This has nothing to do with Snakemake, I was just not too familiar with how the built-in format() function worked. However, the error message didn't make it clear that this was a Python syntax issue and not a Snakemake issue.

Conclusions

I feel that there's probably a better way to present the logic of Snakemake to new users such that they won't encounter these errors, and in some cases the error messaging might be able to be improved. Overall, however, this is so much better than GNU Make for a host of reasons, and is orders of magnitude better than not having any build system at all. Once I get enough experience with it I might write a tutorial or propose changes to the official documentation.

Custom scatterplot colors with colorbar in Matplotlib

Published: 2018-01-24 1 minute read

I needed to plot a standard scatterplot, but color the points according to some third dimension of data, and add a colorbar. This was surprisingly not documented anywhere I could find.

import matplotlib
import matplotlib.pyplot as plt
import random

# We have three dimensions of data. x and y will be plotted on the x and y axis, while z will 
# be represented with color.
# If z is a numpy array, matplotlib refuses to plot this.
x = list(range(300))
y = sorted([random.random()*10 for i in range(300)])
z = list(reversed([i**3 for i in y]))

# cmap will generate a tuple of RGBA values for a given number in the range 0.0 to 1.0 
# (also 0 to 255 - not used in this example).
# To map our z values cleanly to this range, we create a Normalize object.
cmap = matplotlib.cm.get_cmap('viridis')
normalize = matplotlib.colors.Normalize(vmin=min(z), vmax=max(z))
colors = [cmap(normalize(value)) for value in z]

fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(x, y, color=colors)

# Optionally add a colorbar
cax, _ = matplotlib.colorbar.make_axes(ax)
cbar = matplotlib.colorbar.ColorbarBase(cax, cmap=cmap, norm=normalize)

colorbar-scatterplot

Using compound data types in h5py

Published: 2017-12-31 2 minute read

Compound data types allow you to create Numpy arrays of heterogeneous data types and store them in HDF5. For the example in this blog post, we want to store X and Y coordinates (as unsigned 32-bit integers), an intensity value (as a float), and a DNA sequence (as a variable-length string).

import numpy as np
import h5py

my_datatype = np.dtype([('x', np.uint32),
                        ('y', np.uint32),
                        ('intensity', np.float),
                        ('sequence', h5py.special_dtype(vlen=str))])
                 
h5 = h5py.File("my_data.h5", "w")
data = [(1003, 4321, 43.2, "ACGTACTG"), (55, 4098, 12.1, "GGT"), (3209, 909, 59.7, "ACTAC")]
dataset = h5.create_dataset('/coordinates', (len(data),), dtype=my_datatype)
data_array = np.array(data, dtype=my_datatype)

We can look at data_array and see how this datatype looks:

In [1]: data_array
Out[1]: 
array([(1003, 4321,  43.2, 'ACGTACTG'), (  55, 4098,  12.1, 'GGT'),
       (3209,  909,  59.7, 'ACTAC')],
      dtype=[('x', '<u4'), ('y', '<u4'), ('intensity', '<f8'), ('sequence', 'O')])

To save the data efficiently, just assign it to the dataset like so:

dataset[...] = data_array

Regular indexing works:

In [2]: h5['/coordinates'][0]
Out[2]: (1003, 4321,  43.2, 'ACGTACTG')

Selection syntax works basically as you'd expect. Optionally, you can tack on an extra index with fields in it to limit the fields that are returned:

selector = (h5['coordinates']['x'] > 1000) & (h5['coordinates']['intensity'] > 40.0)
fields = ['x', 'y']
coords = h5['/coordinates'][selector][fields]

coords is now a compound datatype consisting of just two 32-bit unsigned integers:

In [3]: coords
Out[3]:
array([(1003, 4321), (3209,  909)],
      dtype=[('x', '<u4'), ('y', '<u4')])

Even though coords is homogenous, many numpy operations can't be performed on it. This might not be the best way to do this, but you can get a regular array with:

In [4]: raw_coordinates = np.stack((coords['x'], coords['y']), axis=-1)
In [5]: raw_coordinates
Out[5]: 
array([[1003, 4321],
       [3209,  909]], dtype=uint32)

Coherent Diagnostics

Published: 2017-02-27 2 minute read

I had to get some diagnostic information from a Coherent laser. It's super easy if you connect via USB on an Ubuntu machine.

First, install minicom:

sudo apt install minicom

Now you need to find the name of the device. Run

dmesg | grep tty

The output will look something like this:

$ dmesg | grep tty
[    0.000000] console [tty0] enabled
[26890.843225] usb 1-2: FTDI USB Serial Device converter now attached to ttyUSB0

Unplug the USB and plug it back in again, and rerun

dmesg | grep tty

You'll see additional messages about something getting disconnected and attached, similar to this:

$ dmesg | grep tty
[    0.000000] console [tty0] enabled
[26890.843225] usb 1-2: FTDI USB Serial Device converter now attached to ttyUSB0
[27005.835517] ftdi_sio ttyUSB0: FTDI USB Serial Device converter now disconnected from ttyUSB0
[27057.953214] usb 1-2: FTDI USB Serial Device converter now attached to ttyUSB0

In this example, the device name is ttyUSB0. Now run

sudo minicom -s

and select “Serial port setup”. Type A (no enter key, it will automatically move your cursor) and change the serial device to

/dev/ttyUSB0

or whatever your device name is. It will have /dev/ in front of it no matter what.

Now type E to set up the other parameters. Type B to lower the speed (or A to raise it) to 19200. It says the current speed at the top. Type Q to set the parity and stop bits. When you're done, it should say at the top:

Current: 19200 8N1

If it does, press enter. Now choose Exit (not Exit from Minicom). You'll be taken to a terminal where you're now connected. You can start issuing commands, which all start with a question mark. Try:

?STA

It should return some sort of number or something. The rest of the commands should be saved here, or you can ask Coherent for them.

nd2reader

Published: 2015-05-26 1 minute read

I recently published nd2reader, a Python library that reads .nd2 image files produced by NIS Elements. Most of the credit goes to the authors of SLOTH who figured out how nd2s are actually structured.

nd2reader improves on things in a few ways. It can associate metadata with individual images, so you can know the channel, timestamp, field of view and z-level of any particular image, which I believe is novel among all the nd2 reading libraries. It also provides a simple, easy-to-use interface and (most importantly to me) does not require Java alongside Python, as Bio-Formats does.

Since it's now in PyPI, you can install it with:

pip install nd2reader

Update: I handed over control of this project and no longer maintain it.