I’m doing a periodic maintenance task for work, where I have to rebuild the container environment that our materials science computations get run in to update the versions of some software that we use, and I’m struck by just how much scientific knowledge is packed into this singularity image file that represents the container.
It takes like an hour to compile on modern hardware, from an Ubuntu (ha! the built in spell check in Ubuntu insists on capitalizing itself) base image with dozens of computational materials science simulation packages and supporting dependencies. These all have to be downloaded, compiled, linked, and installed in such a way that they can mutually interface, and take data and return results in the formats we use.
Once the whole thing is built, the final image is something like a gigabyte in size–5x that if we also install CUDA. I know that 1 Gb doesn’t sound like that much data these days, some internet connections can download that in less than a second, but think about it:
This isn’t a move or an image or a song. This archive contains almost no actual data. That gigabyte of space is pretty much entirely code instead; much of it compiled, executable code at that.
A gigabyte of materials science simulation Code, a billion characters worth of instructions, specifying the very best established practices for modeling the behavior of materials at the atomic level.
I know the digital age is old news, but I can’t get over that level of cumulative effort. It represents tens of thousands of person-months of effort collectively, but I’m just sitting here watching my computer assemble it from a recipe that’s short enough for me to read and sensibly edit, and distill that down into a finely-tuned piece of precision apparatus that fits on a flash drive.
And that isn’t even the important part of our research! That’s just all the shit we need installed on the system as background dependencies to be able to run our custom simulation code on top of that! Its so complex that at this point, its easier to run an entire (limited) virtual machine with our stuff installed in it than to try and convince every supercomputer cluster we work with to install every package separately and keep up with updates.
In case you’re wondering what a day in the life of a computational materials physicist looks like: trying to do upgrades on incredibly complex machinery that you cannot touch or see. At least this time it doesn’t also have to stay running while I do it…