Reproducibility, replicability, and the two layers of computational science

The importance of reproducibility in computational science is being more and more recognized, which I think is a good sign. However, I also notice a lot of confusion about what reproducibility means exactly, and also confusion about the difference (if any) between reproducibility and replicability. I don’t see a consensus yet about the exact meaning of these terms, but I would like to give my own definitions and justify them by putting them into the general context of computational science.

I’ll start with the concept of reproducibility as it was used in science long before computers even existed. It refers to the reproducibility of the conclusions of a scientific study. These conclusions can take very different forms depending on the question that was being explored. It can be a simple “yes” or “no”, e.g. in answering questions such as “Is the gravitational force acting in this stone the same everywhere on the Earth’s surface?” or “Does ligand A bind more strongly to protein X than ligand B?” It can also be a number, as in “What is the lattice energy of NaCl?”, or a mathematical function, as in “How does a spring’s restoring force vary with elongation?” Any such result should come with an estimation of its precision, such as an error bar on numbers, or a reliability estimate for a yes/no answer. Reproducing a scientific conclusion means finding a “close enough” answer by performing “similar” experiments and analyses. As the terms “close enough” and “similar” show, reproducibility involves human judgement, which may well evolve over time. Reproducibility is thus not an absolute feature of a specific result, but the evaluation of a result in the context of the current state of knowledge and technology in a scientific domain. Every attempt to reproduce a given result independently (different people, tools, methods, …) augments scientific knowledge: If the reproduction leads to a “close enough” results, it provides information about the precision with which the results can be obtained, and if if doesn’t, it points to some previously unrecognized crucial difference between the two experiments, which can then be explored.

Replication refers to something much more specific: repeating the exact steps in an experiment using the same (or equivalent) equipment, and comparing the outcomes. Replication is part of testing an experimental setup, or a form of quality assurance. If I measure the same quantity ten times using the same equipment and experimental samples, and get ten slightly different values, then I can use these numbers to estimate the precision of my equipment. If that precision is not sufficient for the purposes of my planned scientific study, then the equipment is not suitable.

It is useful to describe the process of doing research by a two-layer model. The fundamental layer is the technology layer: equipment and procedures that are well understood and whose precision is known from many replication attempts. On top of this, there is the research layer: the well-understood equipment is used in order to obtain new scientific information and draw conclusions from them. Any scientific project aims at improving one or the other layer, but not both at the same time. When you want to get new scientific knowledge, you use trusted equipment and procedures. When you want to improve the equipment or the procedures, you do so by doing test measurements on well-known systems. Reproducibility is a concept of the research layer, replicability belongs to the technology layer.

All this carries over identically to computational science, in principle. There is the technology layer, consisting of computers and the software that runs on them, and the research layer, which uses this technology to explore theoretical models or to interpret experimental data. Replicability belongs to the technology level. It increases trust in a computation and thus its components (hardware, software, overall workflow, provenance tracking, …). If a computation cannot be replicated, then this points to some kind of problem:

  1. different input data that was not recorded in the workflow (interactive user input, a random number stream initialized from the current time, …)
  2. a bug in the software (uninitialized variables, compiler bugs, …)
  3. a fault in the hardware (an unreliable memory chip, a design flaw in the processor, …)
  4. an ambiguous specification of the result of the computation

Ideally, the non-replicability should be eliminated, but at the very least its cause should be understood. This turns out to be very difficult in practice, in today’s computing environments, essentially because case 4 is frequent and hard to avoid (today’s popular programming languages are ambiguous), and because case 4 makes it impossible to identify cases 2 and 3 with certainty. I see this as a symptom of the immaturity of today’s computing environments, which the computational science community should aim to improve on. The technology for removing case 4 exists. The keyword is “formal methods”, and there are first attempts to apply them to scientific computing, but this remains an exotic approach for now.

As in experimental science, reproducibility belongs to the research layer and cannot be guaranteed or verified by any technology. In fact, the “reproducible research” movement is really about replicability – which is perhaps one reason for the above-mentioned confusion.

There is at the moment significant disagreement about the importance of replicability. At one end of the spectrum, there is for example Ian Gent’s recomputation manifesto, which stresses the importance of replicability (which in the context of computational science he calls recomputability) because building on past work is possible only if it can be replicated as a first step. At the other end, Chris Drummond argues that replicability is “not worth having” because it doesn’t contribute much to the real goal, which is reprodcucibility. It is worth reading both of these papers, because they both do a very good job at explaining their arguments. There is actually no contradiction between the two lines of arguments, the different conclusions are due to different criteria being applied: Chris Drummond sees replicability as valuable only if it improves reproducibility (which indeed it doesn’t), whereas Ian Gent sees value in it for a completely different reason: it makes future research more efficient. Neither one mentions the main point in favor of replicability that I have made above: that replicability is a form of quality assurance and thus increases trust in published results.

It is probably a coincidence that both of the papers cited above use the term “computational experiment”, which I think should best be avoided in this context. In the natural sciences, the term “experiment” traditionally refers to constructing a setup to observe nature, which makes experiments the ultimate source of truth in science. Computations do not have this status at all: they are applications of theoretical models, which are always imperfect. In fact, there is an interesting duality between the two: experiments are imperfect observations of the ultimate truth, whereas computations are, in the absence of buggy or ambiguous software, perfect observations of the consequences of imperfect models. Using the same term for these two concepts is a source of confusion, as I have pointed out earlier.

This fundamental difference between experiments and computations also means that replicability has a different status in experimental and computational science. When doing imperfect observations of nature, evaluating replicability is one aspect of evaluating the imperfection of the observation. Perfect observation is impossible, both due to technological limitations and for fundamental reasons (any observation modifies what is being observed). On the other hand, when computing the consequences of imperfect models, replicability does not measure the imperfections of the model, but the imperfections of the computation, which can theoretically be eliminated.

The main source of imperfections in computations is the complexity of computer software (considering the whole software stack, from the operating system to the scientific software). At this time, it is not clear if we will ever succeed in taming this complexity. Our current digital computers are chaotic systems, in which even the tiniest change (flipping a bit in memory, or replacing a single character in a program source code file) can change the result of a computation beyond any bounds. Chaotic behavior is clearly an undesirable feature in any scientific equipment (I can’t think of any experimental apparatus suffering from it), but for computation we currently have no other choice. This makes quality assurance techniques, including replicability but also more standard software engineering practices such as unit testing, all the more important if we want computational results to be trustworthy.

Explore posts in the same categories: Computational science, Reproducible research, Science

8 Comments on “Reproducibility, replicability, and the two layers of computational science”

  1. khinsen Says:

    Shauna Gordon-McKeon mentioned a related blog post of hers on Twitter. It’s about replication in experimental sciences, and most of all it nicely illustrates the confusion about the terminology used in discussing the issue of replication in science.

    My definition of replication in this post most closely corresponds to Shauna’s “literal replication”. She also defines “direct replication” as “duplication of those methods deemed to be relevant.” (please read the whole post for the precise context of these terms). From my point of view, which places replication in science’s technology layer as a means of validation, this “direct replication” is not replication at all. If you knowingly and intentionally change something in a published method, you are testing your hypothesis that these changes are indeed not relevant. That’s clearly in the research layer, so you shouldn’t mix it with a replication attempt.

    There is of course the very relevant problem, which Shauna also describes nicely, that literal replication is an ideal that can rarely be realized completely. In computational science, you might not have access to the exact same computer, or to the exact same version of some piece of software. In that case, pragmatism suggests that you do a best effort to match the original specification. If that best-effort replication leads to substantially different results, then the only way to figure out what happened is to collaborate with the original authors on sorting out the differences.

    There is a considerable grey zone between Shauna’s “literal” and “direct” replication, considering that any attempt at literal replication is imperfect. The main difference is intention: the “direct” replicator doesn’t even want to use the exact same methodology, the “literal” one wants to but cannot. An important in-between situation is an imperfection in literal replication made accepted as an economical compromise: re-doing everything exactly is possible but difficult or costly, whereas introducing small changes considered irrelevant is easier and/or cheaper. I am not even sure that this distinction is ultimately that important: what matters is that any replication attempt clearly states the known differences, with a discussion of why they are considered irrelevant.

    • So is the ideal version of your “replication” (my “literal replication”) done with “access to the exact same computer, or to the exact same version of some piece of software”?

      The needs and motivation may vary in computational science, but I question the value of a perfect literal replication. There are a number of different things that can cause inaccuracy in results, including:

      A) fraud
      B) statistical chance
      C) one off/unrepeatable technical errors
      D) repeatable technical or methodological errors (such as a bug in your code, or a subtle temperature difference in the lab, etc)
      E) conceptual errors (such as hidden confounds)

      A perfect replication (or “literal replication”) can reveal inaccuracies with A, B and C, but misses D and E, which I think cause far more problems than the first three. A literal replication where the methods are re-done by the replicators should capture D as well. A direct replication can detect all five sources. (Figuring out which is the cause of a discrepancy is a different story!)

      I don’t think we should be aiming for perfectly exact/literal replications, though that’s driven by my intuition about rates of inaccuracies due to D vs A, B and C, and my intuition may be wrong.

      • khinsen Says:

        To answer the first question: yes, ideally the whole infrastructure is the same. Today’s computer hardware is uniform enough that “the same machine” is hardly ever a requirement, but the exact same version of a piece of software is sometimes required. That’s why many computational scientists advocate the conservation of computations as virtual machines, which package everything but the hardware.

        Note also that I do not advocate literal replication, but literal replicability. It’s the scientists doing an original study who should verify that their work can be replicated, as part of quality assurance. It’s their responsibility to keep a sufficiently detailed log of what they did that someone else with the same equipment can re-do all the steps. The easiest way to check this is to ask someone else in the team to re-run everything.

        In your list of problems, the main point that replicability protects against is C. In computational science, point C takes the form of mistyping a command, or clicking the wrong button. In a computational study that involves running tens to hundreds of computations, that’s the #1 source of trouble. It’s actually still quite common for computational scientists not to keep any activity log that deserves the name. They don’t even know themselves what exactly they did to get the results
        that they find in their files. It was this situation that prompted John Claerbout to start the reproducible research movement.

        As for your points D and E, they are obviously important as well, and must be checked for. But you cannot even envisage this unless you have some confidence that your results aren’t complete bogus because of a point C problem. You can’t check a program for bugs unless you are sure which program you actually ran. Perhaps this sounds trivial, but I am sure I am not the only computational scientist who regularly types “which python” into a terminal to make sure that I am actually using the right Python installation out of the five that I have on my computer.

        Literal replication of somebody else’s published work is indeed of little practical interest, except when a reproduction attempt (using different tools or methods) leads to different results. There is little chance of ever discovering the reason unless you can first replicate the original work exactly.

  2. It seems like this is a difference between computational science and the experimental sciences I’m more familiar with, although I’m probably underestimating the impact of C. This just points to the need for different approaches for different subfields – and, as we’ve both said, the importance of clarity and precision in terms.

    “The easiest way to check this is to ask someone else in the team to re-run everything.”

    I’d recommend getting undergraduates who are new to the lab (I’m assuming you’re in academia) to participate in re-running the analysis. Their lack of familiarity can help pinpoint all the implicit/cultural/simple stuff that needs to be spelled out for others to reproduce. And of course it’s a great learning experience for them.

    “There is little chance of ever discovering the reason unless you can first replicate the original work exactly.”

    I wish we lived in a world where multiple kinds of replications were often done! Unfortunately in psychology, my field of origin, something like 1% of studies get replicated even once (

    • khinsen Says:

      I agree that the specific situation of each field needs to be taken into account. I have followed the debate about replication in psychology in the press, and I understand that it is difficult to get anything replicated in a field where replication of a study is an immense effort. Replication of most computational studies, if they are designed to be replicable, involves very little investment in human time and in most cases an amount of computer time that is not much of a cost factor. The cost of replication in computational science is paid up-front, by investing an additional effort in the initial design.

      • Yes, I can see how exact replication in these circumstances is low-investment and *should* be very commonplace. Do you know what percentage of work ends up being replicated in this way? I hope more than 1%.

        If you are interested in writing about this/reposting this at the Open Science Collaboration blog () please let me know – we’d be happy to signal boost.

      • khinsen Says:

        Unfortunately I have no idea how many published computations are replicated by others. Since that’s a low-effort step, I don’t expect ever to hear about replications unless they indicate that something is wrong. Note that at this time, the main limitation to replicating published computations is that very few computations actually get published in a replicable way.

        I am, however, rather sure that all practitioners of reproducible computational research replicate their own computations regularly before publishing them. They usually report that this is a big help for them in doing their work, and tend to conclude that adopting this practice is worth the additional initial effort only for the personal benefit it procures. I certainly arrive at this conclusions from my own experience. I find it rather unpleasant to go back to the old way of doing things, which I find necessary when the software I need or my mode of collaboration doesn’t allow me to use my replicability toolbox.

  3. […] let me start by pointing out that within the systematic terminology that I am trying to adopt (see this post for an explanation), I will write “bitwise replicability” from now on, as the problem […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: