Python as a platform for reproducible research

The other day I was looking at the release notes for the recently published release 1.8 of NumPy, the library that is the basis for most of the Scientific Python ecosystem. As usual, it contains a list of new features and improvements, but also sections such as “dropped support” (for Python 2.4 and 2.5) and “future changes”, to be understood as “incompatible changes that you should start to prepare for”. Dropping support for old Python releases is understandable: maintaining compatibility and testing it is work that needs to be done by someone, and manpower is notoriously scarce for projects such as NumPy. Many of the announced changes are in the same category: they permit removing old code and thus reduce maintenance effort. Other announced changes have the goal of improving the API, and I suppose they were more controversial than the others, as it is rarely obvious that one API is better than another one.

From the point of view of reproducible research, all these changes are bad news. They mean that libraries and scripts that work today will fail to work with future NumPy releases, in ways that their users, who are usually not the authors, cannot easily understand or fix. Actively maintained libraries will of course be adapted to changes in NumPy, but much, perhaps most, scientific software is not actively maintained. A PhD student doing computational reasearch might well publish his/her software along with the thesis, but then switch subjects, or leave research altogether, and never look at the old code again. There are also specialized libraries developed by small teams who don’t have the resources to do as much maintenance as they would like.

Of course NumPy is not the only source of instability in the Python platform. The most visible change in the Python ecosystem is the evolution of Python itself, whose 3.x series is not compatible with the initial Python language. It is difficult to say at this time for how long Python 2.x will be maintained, but it is well possible that much of today’s scientific software written in Python will become difficult to run ten years from now.

The problem of scientific publications becoming more and more difficult to use is not specific to computational science. A theoretical physicist trying to read Isaac Newton’s works would have a hard time, because the mathematical language of physics has changed considerably over time. Similarly, an experimentalist trying to reproduce Galileo Galilei’s experiments would find it hard to follow his descriptions. Neither is a problem in practice, because the insights obtained by Newton and Galilei have been reformulated many times since then and are available in today’s language in the form of textbooks. Reading the original works is required only for studying the history of science. However, it typically takes a few decades before specific results are universally recognized as important and enter the perpetually maintained canon of science.

The crucial difference with computations is that computing platforms evolve much faster than scientific research. Researchers in fields such as physics and chemistry routinely consult original research works that are up to thirty years old. But scientific software from thirty years ago is almost certainly unusable today without changes. The state of today’s software thirty years from now is likely to be worse, since software complexity has increased significantly. Thirty years ago, the only dependencies a scientific program would have is a compiler and perhaps one of a few widely known numerical libraries. Today, even a simple ten-line Python script has lots of dependencies, most of the indirectly through the Python interpreter.

One popular attitude is to say: Just run old Python packages with old versions of Python, NumPy, etc. This is an option as long as the versions you need are recent enough that they can still be built and installed on a modern computer system. And even then, the practical difficulties of working with parallel installation of multiple versions of several packages are considerable, in spite of tools designed to help with this task (have a look at EasyBuild, hashdist, conda, and Nix or its offshoot Guix).

An additional difficulty is that the installation instructions for a library or script at best mention a minimum version number for dependencies, but not the last version with which they were tested. There is a tacit assumption in the computing world that later versions of a package are compatible with earlier ones, although this is not true in practice, as the example of NumPy shows. The Python platform would be a nicer place if any backwards-incompatible change were accompanied by a change in package name. Dependencies would then be evident, and the different incompatible versions could easily be installed in parallel. Unfortunately this approach is rarely taken, a laudable exception being Pyro, whose latest incarnation is called Pyro4 to distinguish it from its not fully compatible predecessors.

I have been thinking a lot about this issue recently, because it directly impacts my ActivePapers project. ActivePapers solves the dependency versioning problem for all code that lives within the ActivePaper universe, by abandoning the notion of a single collection of “installed packages” and replacing it by explicit references to a specific published version. However, the problem persists for packages that cannot be moved inside the ActivePaper universe, typically because of extension modules written in a compiled language. The most fundamental dependencies of this kind are NumPy and h5py, which are guaranteed to be available in an ActivePapers installation. ActivePapers does record the version numbers of NumPy and h5py (and also HDF5) that were used for each individual computation, but it has currently no way to reproduce that exact environment at a later time. If anyone has a good idea for solving this problem, in a way that the average scientist can handle without becoming a professional systems administrator, please leave a comment!

As I have pointed out in an earlier post, long-term reproducibility in computational science will become possible only if the community adopts a stable code representation, which needs to be situated somewhere in between processor instruction sets and programming languages, since both ends of this spectrum are moving targets. In the meantime, we will have to live with workarounds.

Explore posts in the same categories: Reproducible research, Science

4 Comments on “Python as a platform for reproducible research”

  1. Daniel Says:

    I don’t think a stable IL is enough for reproducible research.
    In the far future someone can run the IL, but he can neither modify it, nor can he take parts and use them in a new algorithm.
    It will be like having a DOS executable from the 80s. Good if it does exactly what you want, and otherwise completely useless.

    I think we will not only need an intermediate language representation (IL) that abstracts the machine code, but also something that abstracts the source code, like this:

    Machine Code — IL — Source — more abstract representation

    The source code is only slightly better than the IL, it is far too much tied to the underlying architecture. Any advanced features of current programming languages (generators, monads etc.) will probably be close to incomprehensible to someone in the future, and impossible to express in a new language.

    This is a shame, because many problems have a deep structure that is lost forever when you represent them in a programming language.

    While IL is a stable code representation, the source abstraction should be a stable knowledge representation.

    Take for example linear algebra. This is a highly regular and simple field, but there is no way to express ideas in a language-independent fashion. For every new language that comes along, linear algebra has to be re-invented.

    The same for symbolic algebra. A tremendous amount of knowledge is locked into the the Axiom/Scratchpad system, probably forever.

    What we need are approaches similar to libflame, that make it possible to move the structure of a problem to a new programming language.

    This will not work for every problem, but for many areas it has to be possible to find a representation that is isomorphic to the underlying mathematical structure, and well enough defined to be translated into the language du jour.

    What I would envision is a system that defines the semantics of the problem as far as possible (‘this is a matrix-matrix multiplication’, ‘this is a differential equation’, ‘this is an integral’) and in the worst case defaults to the status quo (‘this is just some procedural code you have to execute, don’t know what it means’)

    This representation (with the exception of the procedural code) can be much more stable, because it is not tied to a single programming language. If it is sufficiently well defined, one can have a minimal compiler that can translate to a stable IL or machine code. Hopefully the user in the future will not need to look at the procedural part.

    • khinsen Says:

      I think it’s useful to distinguish reproducibility (re-running the exact same computation) from reusability (running a modified computation), although the first is often done as a first step before moving on to the second. My proposition about an low-level IL covers only reproducibility, as you explained very well. I have no good idea to offer for ensuring reusability over long time periods. Programming languages are a rapidly moving target, with one cause being that today’s language are still so far from what we can wish for. I like your ideas about a more universal notation for certain concepts, but nothing like that is within reach at this time.

  2. Daniel Says:

    Have you looked at nuitka?
    It compiles python to c++, including all the libraries.
    http://nuitka.net/
    If you compile the c++ without any machine specific optimizations, maybe to LLVM IR, it’s going to be runnable in the future.
    At least this takes care of bundling in all of the libraries.

    HLVM is at an early stage (and seems to be stalled without community involvement), but it would make for a good IL representation, a bit higher level than LLVM
    http://www.ffconsultancy.com/ocaml/hlvm/

    • khinsen Says:

      I doubt that anything involving C++ is a good choice for reproducibility, given that much C++ software comes with a (short) list of compatible compiler versions.

      LLVM IR could be a useful choice for long-time preservation of computations, except that the LLVM team very explicitly recommends against this (see the thread at http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/043720.html for example). The PNaCl team has adopted a variant of LLVM IR nevertheless, and plans to support it for an indefinite time. Time will tell if they will be able to keep their promise.

      LLVM IR is quite low-level, so it needs to be complemented by a non-trivial run-time library that needs to be maintained indefinitely as well.

      HLVM looks like a good idea but it is definitely not ready for use in real life.


Leave a reply to Daniel Cancel reply