Compound data types allow you to create Numpy arrays of heterogeneous data types and store them in HDF5. For the example in this blog post, we want to store X and Y coordinates (as unsigned 32-bit integers), an intensity value (as a float), and a DNA sequence (as a variable-length string).

import numpy as np
import h5py
my_datatype = np.dtype([('x', np.uint32),
('y', np.uint32),
('intensity', np.float),
('sequence', h5py.special_dtype(vlen=str))])
h5 = h5py.File("my_data.h5", "w")
data = [(1003, 4321, 43.2, "ACGTACTG"), (55, 4098, 12.1, "GGT"), (3209, 909, 59.7, "ACTAC")]
dataset = h5.create_dataset('/coordinates', (len(data),), dtype=my_datatype)
data_array = np.array(data, dtype=my_datatype)

We can look at `data_array`

and see how this datatype looks:

In [1]: data_array
Out[1]:
array([(1003, 4321, 43.2, 'ACGTACTG'), ( 55, 4098, 12.1, 'GGT'),
(3209, 909, 59.7, 'ACTAC')],
dtype=[('x', '<u4'), ('y', '<u4'), ('intensity', '<f8'), ('sequence', 'O')])

To save the data efficiently, just assign it to the dataset like so:

dataset[...] = data_array

Regular indexing works:

In [2]: h5['/coordinates'][0]
Out[2]: (1003, 4321, 43.2, 'ACGTACTG')

Selection syntax works basically as you'd expect. Optionally, you can tack on an extra index with `fields`

in it to limit the fields that are returned:

selector = (h5['coordinates']['x'] > 1000) & (h5['coordinates']['intensity'] > 40.0)
fields = ['x', 'y']
coords = h5['/coordinates'][selector][fields]

`coords`

is now a compound datatype consisting of just two 32-bit unsigned integers:

In [3]: coords
Out[3]:
array([(1003, 4321), (3209, 909)],
dtype=[('x', '<u4'), ('y', '<u4')])

Even though `coords`

is homogenous, many numpy operations can't be performed on it. This might not be the best way to do this, but you can get a regular array with:

In [4]: raw_coordinates = np.stack((coords['x'], coords['y']), axis=-1)
In [5]: raw_coordinates
Out[5]:
array([[1003, 4321],
[3209, 909]], dtype=uint32)