...to clean up the data from the Multi-Worm Tracker. The MWT emits tens of thousands of individual tracked objects over an hour, despite there only being a few dozen worms for it to observe.
An average experiment lasts for 50000 frames, and in each frame there are approximately 2 million "blob-frame" tracking entries.
Blob: a computer-vision term denoting something different than the surroundings.
Every time we test a new method or generate a plot, we need to read the data.
- experience with the application
- refactoring something you'll maybe use once then discard is not the greatest use of your time
- what part of your code is used frequently? focus on that
- thoughtfully designed
- pleasant to use
- Examples:
- requests
- PEP-8 does not beautiful code make; it is almost orthogonal
Numpy can make things more beautiful:
a = list(range(10))
b = list(range(10))
c = [aa + bb for aa, bb in zip(a, b)]
a = np.arange(10)
b = np.arange(10)
c = a + b
Or ugly:
x = [[1] * 3, [1] * 2]
y = [[0] * 2, [0] * 5]
z = [sum(zz, []) for zz in zip(x, y)]
z
# [[1, 1, 1, 0, 0], [1, 1, 0, 0, 0, 0, 0]]
x = np.array([np.ones(3), np.ones(2)])
y = np.array([np.zeros(2), np.zeros(5)])
z = np.empty(len(x), dtype=object)
for i in range(len(x)):
z[i] = np.concatenate([x[i], y[i]])
z
# array([array([ 1., 1., 1., 0., 0.]),
# array([ 1., 1., 0., 0., 0., 0., 0.])], dtype=object)
- beautiful (good external design)
- idiomatic (good internal design)
- use of generators
- clean (well-formatted, easy to read)
- pep-8
Multiworm - An adapter to load data. Does one thing: load data.
MongoDB
-
Why it's appealing
- Just connect to the database to access and store data in an arbitrary document-based NoSQL system
- Fast random lookups if indexed
-
Why it didn't work
- Loading a single experiment (ETL) took hours
- Change in schema required discarding all data
- Impossible to store all the data efficiently
- ~10x expansion of data size, turns "lots" of data into "too much"
HDF5
-
Why it's appealing
- Compact on-disk storage
- Constant-time (O(1)) indexing into (some) data
-
Why it didn't work
- ETL
- Jagged arrays in Numpy are clunky
- We keep accessing the raw data files and reading them into some other storage method convenient for analysis.
- We keep changing that storage method
- Create a layer between storage and analysis that exposes something consistent to allow future changes to storage layer without requiring re-write of analysis code. (Design pattern!)
The interface was designed to make common tasks easy and convenient.
- Get a single blob by ID
- Get all the blobs
- Get where the blob started/ended
- Get all blobs that existed at a specific time
- Get the closest image in time
Loading all the data is too much to handle, and usually not needed. Using lazy proxies (another design pattern) can help.