, , , , ,

I’ve been warned that I sometimes veer too far in the direction of toolmaker away from the standard path followed by most scientists.  Try as I might, I cannot seem to avoid finding the process of doing science nearly as interesting as the goal of getting that science done. And so, my mind has been orbiting around a problem I suspect is endemic amongst all physicists, if not all scientists.  That problem, captured so nicely by this PhD comic is that of filesystem cruft.  Science, being at it’s core an experimental art, produces for every successful idea a whole panoply of failed experiments, mistakes, and generally messed-up crap.  Being paranoid creatures consumed by our own fears, along with the awareness that serendipity has been a cornerstone of great work, we are loathe to sweep these ill-fated children of the mind into the trash where they (mostly) belong. And so those of us who rely on computers for most of our day-to-day work end up with home directories filled to the brim with old scripts, corrupted data files, a dozen different versions of the same list of values, and other digital detritus.  And this situation makes for errors, confusion, thousand yard stare, anal leakage, and other evils too foul to discuss in polite company.  Just looking at my /home directory on my workstation at the University, I have more than 100,000 files sitting around, waiting for me to stare at them for a quarter hour trying to remember what they were for.

I’m pretty certain that this isn’t merely my personal problem.  Astronomers have tons of image files, and scripts to work with those images.  Theorists have simulations and numerical codes.  Experimentalists have data files and reduction tools.  And everyone has far too many old versions of all these files, either obsolete or incorrect.  This is a big problem, because science already is hard without having to guess which of ten files were correct six months ago.  I waste far too much time on trivial mistakes that could be avoided if I had used the “current best” version of a file.  Simply deleting old data is no solution, since it seems I go back to earlier versions of a file intentionally at least as often as I do it by accident.

I’ve come to the conclusion that this is a (semi) solved problem.  This is almost the same issue that software engineers were faced before the advent of modern version control systems.  There are a few key differences though. First, scientists often deal with binary files that standard ASCII-oriented tools like diff don’t play nicely with. Second, we typically produce multiple terabytes of these binary files.  I currently have over 200Gb in my research directory, and that is after compressing a lot of old data.  So standard tools like svn and vanilla git aren’t going to do the trick.  But in principle they have all the features we would need.  Git can do all the things I would need it to do for source files.  I simply need to sit down and think about how to solve the unique challenges physicists would face.  In the next post, I will explain my thoughts on how to do this.