Platforms for reproducible research

This post was motivated by Ian Gent’s recomputation manifesto and his blog post about it. While I agree with pretty much everything said there, there is one point that I strongly disagree with, and here I’d like to explain the reasons in some detail. The point in question is “The only way to ensure recomputability is to provide virtual machines”. To be fair, the manifesto specifies that it’s the only way “at least for now”, so perhaps our disagreement is not as pronounced as it may seem.

I’ll start with a quote from the manifesto that shows that we have similar ideas of the time scales over which computational research should be reproducible:
“It may be true that code you make available today can be built with only minor pain by many people on current computers. That is unlikely to be true in 5 years, and hardly credible in 20.”

So the question is: how can we best ensure that the software used in our computational studies can still be run, with reasonable effort, 20 years from now. To answer that question, we have to look at the possible platforms for computational research.

By a “platform”, I mean the combination of hardware and software that is required to use a given piece of digital information. For example, Flash video requires a Flash player and a computer plus operating system that the Flash player can run on. That’s what defines the “Flash platform”. Likewise, today’s “Web platform” (a description that requires a date stamp to be precise, because Web standards evolve so quickly) consists of HTML5, JavaScript, and a couple of related standards. If you want to watch a Flash video in 20 years, you will need a working Flash platform, and if you want to use an archived copy of a 2013 Web site, you need the 2013 Web platform.

If you plan to distribute some piece of digital information with the hope that it will make sense 20 years from now, you must either have confidence in the longevity of the platform, or be willing and able to ensure its long-term maintenance yourself. For the Flash platform, that means confidence in Adobe and its willingness to keep Flash alive (I wouldn’t bet on that). For the 2013 Web platform, you may hope that its sheer popularity will motivate someone to keep it alive, but I wouldn’t bet on it either. The Web platform is too complex and too ill-defined to be kept alive reliably when no one uses it in daily life any more.

Back to computational science. 20 years ago, most scientific software was written in Fortran 77, often with extensions specific to a machine or compiler. Much software from that era relied on libraries as well, but they were usually written in the same language, so as long as their source code remains available, the platform for all that is a Fortran compiler compatible with the one from back then. For standard Fortran 77, that’s not much of a problem, whereas most of the vendor-specific extensions have disappeared since. Much of that 20-year-old software can in fact still be used today. However, reproducing a computational study based on that software is a very different problem: it also requires all the input data and an executable description of the computational protocol. Even in the rare case that all that information is available, it is likely to depend on lots of other software pieces that may not be easy to get hold of any more. The total computational platform for a given research project is in fact as ill-defined as the 2013 Web platform.

Today’s situation is worse, because we use more diverse software written in more different languages, and also use more interactive software whose use is notoriously non-reproducible. The only aspect where we have gained in standardization is the underlying hardware and OS layer: pretty much all computational science is done today on x86 processors running Linux. Hence the idea of conserving the full operating environment in the form of a virtual machine. Just fire up VirtualBox (or one of the other virtual machine managers) and run an exact copy of the original study’s work environment.

But what is the platform required to run today’s virtual machines? It’s VirtualBox, or one of its peers. Note however that it’s not “any of today’s virtual machine managers” because compatibility between their virtual machine formats is not perfect. It may work, or it may not. For simplicity I will use VirtualBox in the following, but you can substitute another name and the basic arguments still hold.

VirtualBox is a highly non-trivial piece of software, and it has very stringent hardware requirements. Those hardware requirements are met by the vast majority of today’s computing equipment used in computational science, but the x86 platform is losing market share rapidly on the wider computing device market. VirtualBox doesn’t run on an iPad, for example, and probably it never will. Is VirtualBox likely to be around in 20 years? I won’t dare a prediction. If x86 survives for another 20 years AND if Oracle sees a continuing interest in this product, then it will. I won’t bet on it though.

What we really need for long-term recomputability is a simple platform. A platform that is simple enough that the scientific community alone can afford to keep it alive for its own needs, even if no one else in the world cares about it.

Unfortunately there is no suitable platform today, to the best of my knowledge. Which is why virtual machines are perhaps the best option right now, for lack of a satisfactory one. But if we care about recomputability, we should design and develop a good supporting platform, starting as soon as possible.

For a more detailed discussion of this issue, see this paper written by yours truly. It comes to the conclusion that the closest existing approximation to a good platform is the Java virtual machine. What we’d want ideally is something similar to the JVM, but designed and optimized for scientific applications. A basic JVM implementation is quite simple (the complex JIT stuff is not a requirement), a few orders of magnitude simpler than VirtualBox, and it has no specific hardware dependencies. It’s even simpler than many of today’s scientific software packages, so the scientific community can definitely afford to keep it alive, The tough part is… no, it’s not designing or writing the required software, it’s agreeing on a specification. Perhaps it will never happen. Perhaps virtual machines will remain the best choice for lack of a satisfactory one. Or perhaps we will end up compiling our software to asm.js and run in the browser, just because someone else will keep that platform alive for us, no matter how ill-adapted it is to our needs. But don’t say you haven’t been warned.

About these ads
Explore posts in the same categories: Computational science, Reproducible research

6 Comments on “Platforms for reproducible research”


  1. Something like http://Docker.io is an interesting model too. It doesn’t solve all the computational architecture problems, but it allows one to containerize a lump of code and all its dependencies for a specific architecture.

  2. Ian Gent Says:

    Thanks very much for this post and the kind comments on my manifesto. Sorry I only just found your blog post today.

    As you said we are thinking on similar time frames and you are quite right that there are no guarantees that the VBox platform (for example) will be around in 10 years.

    My own thinking – but it could be wrong – is that with sufficient resources, one could either port experiments to another platform if the older one became unavailable, or maintain a given platform specifically for the point of recomputation. But absolutely the longer term issues like this are ones I haven’t thought too much about.

    I would make one point about the benefits of the VM approach. This is that (under wide though not universal assumptions) we can store any experiment somebody does. Because they can (in the worst case) give us a copy of their physical machine in virtual form. So there are minimal assumptions or requirements about the platform they use.

    So to summarise, we might disagree but it’s on means, not ends. And if you are right and I’m wrong, and that results in more experiments being recomputable, that’s a great result.

    Ian Gent

    • khinsen Says:

      Thanks for your comments! It’s certainly envisageable to turn the “VirtualBox” platform into something more portable that could be maintained just for the sake of recomputability, but I don’t see it happening. The fundamental problem is that VirtualBox emulates arcane details of PC hardware in order to work with existing operating systems, and then the installed operating system recognizes that arcane hardware and installs drivers etc. that rely on it. That means we have to emulate the same arcane hardware 20 years from now just to be able to boot the old virtual system.

      One possible approach would be to adopt a better-defined virtual machine level. Docker has been cited by someone, and would be a good candidate, but it’s a very recent tool that may or may not survive. JPC (http://jpc.sourceforge.net/home_home.html) could be used as a virtual machine platform defined 100% in software, but I already hear the cries about insufficient performance.

      The main point of my post is that we need to think about what the required platform is for each variety of canned computation. It should be a conscious choice and not whatever represents the least immediat effort.

      • Ian Gent Says:

        Maybe counterintuitively, 20 year old hardware may be less of a problem, because you can throw cpu and memory at it. This is clearly seen in gaming, where pretty much any old game system can be emulated and is because people want to play games. They do indeed emulate old arcane hardware purely to run these old programmes. On the other hand the only one of XBox 360/PS3/Wii which could play the previous generation was Wii, and only because the hardware was so similar.

        That’s a side note really though. I completely agree with your last paragraph, even though I might not be making the choice consciously enough.


  3. […] I have pointed out in an earlier post, long-term reproducibility in computational science will become possible only if the community […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: