The state of NumPy

The release of NumPy 1.9 a few days ago was a bit of a revelation for me. For the first time in the combined history of NumPy and its predecessor Numeric, a new release broke my own code so severely thatI don’t see any obvious way to fix it, given the limited means I can dedicate to software maintenance. And that makes me wonder for which scientific uses today’s Python ecosystem can still be recommended, since the lack of means for code maintenance is a chronic and endemic problem in science.

I’ll start with a historical review, for which I am particularly well placed as one of the oldtimers in the community: I was a founding member of the Matrix-SIG, a small group of scientists who in 1995 set out to use the still young Python language for computational science, starting with the design and implementation of a module called Numeric. Back then Python was a minority language in a field dominated by Fortran. The number of users started to grow seriously from 2000, to the point of now being a well-recognized and respected community that spans all domains of scientific research and holds several
conferences per year across the globe. The combination of technological change and the needs of new users has caused regular changes in the code base, which has grown as significantly as the user base: the first releases were small packages written and maintained by a single person (Jim Hugunin, who later became famous for Jython and IronPython), whereas today’s NumPy is a complex beast maintained by a team.

My oldest published Python packages, ScientificPython and MMTK, go back to 1997 and are still widely used. They underwent a single major code reorganization, from module collections to packages when Python 1.5 introduced the package system. Other than that, most of the changes to the code base were implementations of new features and the inevitable bug fixes. The two main dependencies of my code, NumPy and Python itself, did sometimes introduce incompatible changes (by design or as consequences of bug fixes) that required changes on my own code base, but they were surprisingly minor and never required more than about a day of work.

However, I now realize that I have simply been lucky. While Python and its standard library have indeed been very stable (not counting the transition to Python 3), NumPy has introduced incompatible changes with almost every new version over the last years. None of them ever touched functionalities that I was using, so I barely noticed them when looking at each new version’s release notes. That changed with release 1.9, which removes the compatbility layer with the old Numeric package, on which all of my code relies because of its early origins.

Backwards-incompatible changes are of course nothing exceptional in the computing world. User needs change, new ideas permit improvements, but existing APIs often prevent a clean or efficient implementation of new features or fundamental code redesigns. This is particularly true for APIs that are not the result of careful design, but of organic growth, which is the case for almost all scientific software. As a result, there is always a tension between improving a piece of software and keeping it compatible with code that depends on it. Several strategies have emerged to deal with, depending on the priorities of each community. The point I want to make in this post is that NumPy has made a bad choice, for several reasons.

The NumPy attitude can be summarized as “introduce incompatible changes slowly but continuously”. Every change goes through several stages. First, the intention of an upcoming changes is announced. Next, deprecation warnings are added in the code, which are printed when code relying on the soon-to-disappear feature is executed. Finally, the change becomes effective. Sometimes changes are made in several steps to ease the transition. A good example from the 1.9 release notes is this:

In NumPy 1.8, the diagonal and diag functions returned readonly copies, in NumPy 1.9 they return readonly views, and in 1.10 they
will return writeable views.

The idea behind this approach to change is that client code that depends on NumPy is expected to be adapted continuously. The early warnings and the slow but regular rythm of change help developers of client code to keep up with NumPy.

The main problem with this attitude is that it works only under the assumption that client code is actively maintained. In scientific computing, that’s not a reasonable assumption to make. Anyone who has followed the discussions about the scientific software crisis and the lack of reproduciblity in computational science should be well aware of this point that is frequently made. Much if not most scientific code is written by individuals or small teams for a specific study and then modified only as much as strictly required. One step up on the maintenance ladder, there is scientific code that is published and maintained by computational scientists as a side activity, without any significant means attributed to software development, usually because the work is not sufficiently valued by funding agencies. This is the category that my own libraries belong to. Of course the most visible software packages are those that are actively maintained by a sufficiently strong community, but I doubt they are representative for computational science as a whole.

A secondary problem with the “slow continuous change” philosophy is that client code becomes hard to read and understand. If you get a Python script, say as a reviewer for a submitted article, and see “import numpy”, you don’t know which version of numpy the authors had in mind. If that script calls array.diag() and modifies the return value, does it expect to modify a copy or a view? The result is very different, but there is no way to tell. It is possible, even quite probable, that the code would execute fine with both NumPy 1.8 and the upcoming NumPy 1.10, but yield different results.

Given the importance of NumPy in the scientific Python ecosystem – the majority of scientific libraries and applications depends on it -, I consider its lack of stability alarming. I would much prefer the NumPy developers to adopt the attitude to change taken by the Python language itself: accumulate ideas for incompatible changes, and apply them in a new version that is clearly labelled and announced as incompatible. Everyone in the Python community knows that there are important differences between Python 2 and Python 3. There’s a good chance that a scientist publishing a Python script will clearly say if it’s for Python 2 or Python 3, but even if not, the answer is often evident from looking at the code, because at least some of the many differences will be visible.

As for my initial question for which scientific uses today’s Python ecosystem can still be recommended, I hesitate to provide an answer. Today’s scientific Python ecosystem is not stable enough for use in small-scale science, in my opinion, although it remains an excellent choice for big communities that can somehow find the resources to maintain their code. What makes me hesitate to recommend not using Python is that there is no better alternative. The only widely used scientific programming language that can be considered stable, but anyone who has used Python is unlikely to be willing to switch to an environment with tedious edit-compile-run cycles.

One possible solution would be a long-time-support version of the core libraries of the Python ecosystem, maintained without any functional change by a separate development team. But that development team has be created and funded. Any volunteers?

Explore posts in the same categories: Uncategorized

Tags: ,

You can comment below, or link to this permanent URL from your own site.

19 Comments on “The state of NumPy”

  1. It’s not perfect since it doesn’t account for system level dependencies, but you can certainly publish a list of library vesion requirements along with your code. The format used by pip (e.g. numpy>=1.6, requirements.txt). That makes it relatively easy to just recreate the environment with the versions of software the author intended to use without any guesswork.

    • That’s a short-term workaround, but not a good long-term solution. Many users (as opposed to developers) don’t have the competence to juggle with different Python installations. And what do you do if you use two libraries, which each insist on a different version of NumPy?

      • “And what do you do if you use two libraries which each insist on a different version of numpy”

        This is exactly what virtualenvs solve and why they’re such a good idea. They allow you to easily isolate your Python environments for different programs. So two programs, ProgA & ProgB living on the same system can easily specify `numpy>=1.6, libfoo==1.2.3, libbar==4.5.6` and `numpy==1.9.2, libfoo=2.4.6, libbar==5.6.7` respectively.

  2. cournape Says:

    Virtualenv do not solve this issue at all, as you need to use both libraries in the same program. This problem becomes more common as incompatibilities spread.

  3. Randle, The original comment said “libraries” not “programs”. Does virtualenv really solve this problem?
    (while I generally advocate virtualenv, I don’t think this is the case).

    This is why I advocate working on distributions (I’m heavily involved in Debian, on climate and meteorology based codes) for this reason. Distros basically work on the task of systems integration, and the bulk of the work is exactly this: porting (often otherwise unmaintained) codes / libraries to new versions of libraries, etc, integrating; .e.g the current work to make sure all python codes work on python3, etc.

    I’d advocate a policy of “we used Debian 7 + library foo == 1.2 in a virtualenv”, that is: state the base, state variations where you moved beyond this, which will typically only be a handful of libraries.

    I work in a supercomputing centre, and we have “custom” numpy and scipy versions: that is, building with Intel cc/fortran optimised for our specific hardware (including Intel MKL for blas, lapack, etc.) is typically 10-20x faster. So we tell users to use (1) our NumPy + SciPy builds, then virtualenv for pure python libraries. I think similar for distros, e.g. Debian using atlas vs pure blas/lapack,etc.

    • I agree with all of that. Also, let’s not forget that virtualenv is a tool for developers and power users. Most “standard users” (in computational biology, that means people with zero programming experience, not even shell scripts) just use whatever their Linux distro provides them with. Unfortunately Linux distros vary widely in their approach to the stable vs. recent version compromise, which is why packages aimed at non-programmers needs to work with a wide variety of versions of Python and the other fundamental stuff, including NumPy.

    • I was coming from the perspective of “We currently have a working program which relies on this set of libraries, how do we ensure we can distribute it and people can run it as-is in the future.” For this problem virtualenv works (ignoring Konrad’s very valid concern that many users don’t have the ability to set up virtualenvs on their own)..

      If you’re talking about writing new software where you need to use library A & B which each depend on a different version of library C then, no, virtualenv can not solve that. That’s not a problem unique to Python or science though.

  4. mangecoeur Says:

    Incompatible library versions has always been a problem for all software developers, that’s why things like pip allow you to pin to specific versions. Your packages should specify numpy versions to run. I don’t think the numpy approach is unreasonable in this context, personally I think the “break a few things slowly” approach is actually a better way to make sure people keep upgrading the library rather than refusing to upgrading to a new version with big changes and instead stick to old version which then needs to be maintained indefinitely. You may grumble but if you see it from their end – they are mostly volunteers maintaining scientific software and keeping that functionality forces them to put effort into maintaining decades old code that ultimately relatively few people use. I think it’s fair for them to decide at some point its up to people using the library to make the changes or to stay with old versions.

    In any case you already pointed out the real problem – the lack or reproducibility in scientific code. This is because code is generated ad-hoc, but attitudes need to change a lot here and people need to realise that having un-transparent code to generate results is just as bad as not documenting your experimental procedure. If scientists learn basic software project management skills this fends off a lot of problems like the one you describe.

    • I won’t defend bad software practices by scientists, they do need to change. But it is definitely not reasonable to insist that all published code be maintained forever to adapt to changing computing environments. Maintenance is possible only for code that is in continuous use, i.e. libraries and tools of general interest.

  5. Jeff Says:

    If your code requires a specific version of numpy, then make sure you install that version.. simple as that.

    If your concern if with the end user being competent enough to also install the correct version, then it takes a bit more effort on your part but this is completely achieveable. You want an exact replica of your environment? Use Vagrant, or Docker, or snapshot an EC2 instance.

    Instead of telling your users to “go install numpy and then run from the CLI” Tell them to:
    go install Vagrant and virtualbox…
    vagrant up
    vagrant ssh


    Reproducible environments are something fundamental to science. It’s not numpy’s fault that you don’t have one.

    • I understand your arguments from the point of view of a power user, but you and me are a minority in computational science. My #1 technical support issue is people who can’t get netCDF installed. How would they deal with VirtualBox and vagrant? Those are tools for the early adopters of reproducible research. They are not ready for “the masses”.

      BTW, I don’t know of any journal accepting a VirtualBox image as supplementary material to a paper. Most do accept Python scripts.

      I think we both agree that the quality standard in scientific computing need to improve, but that won’t happen overnight. In the meantime, un unstable computational infrastructure is a problem.

      • Some computer-science conference proceedings (which are the primary way of getting credit in that discipline, rather than journal articles) now accept “computational artifacts” alongside the papers themselves. These artifacts are typically VirtualBox VMs. See for example recent instances of the top programming languages conferences, POPL and PLDI. I’m certainly not aware of anyone doing this outside of CS though.

      • Interesting. Who hosts the VM images? The conference organizers? There seem to be few data hosting services that accept the typically large VM images. Figshare’s 250 MB limit is too small, and most journal’s supplementary material is subject to similar restraints. Zenodo’s 2 GB limit should allow depositing VMs that don’t need bloated GUI libraries.

      • I’m not clear on who hosts the images. Here’s the submission instructions for a recent conference which describes what to submit and the peer-review process, but not what happens with the accepted artifacts:

  6. I really love Python for my scientific computing. But I’m leary to keep long term programs in Python, for the exact reasons you point out. The probability of breakage scales with the number of ‘import’ statements. To me, Fortran is the best bet. With backwards compatibily all the way back to F77, I feel safe in the investment. And given how much code depends upon widely used libraries, backwards incompatibility is unlikely to be introduced. But I have limited background. Perhaps it’s not so cut and dry once you include many numerical libraries.

  7. We have built Numpy for the JVM: It’s got a versatile n-dimensional array object, integrates with Hadoop and GPUs, obviously works with Java, Clojure and Scala, the latter being an exceptional language for scientific computing.

    • That looks interesting. From a Python point of view, this could be the basis for implementing something NumPy-compatible for Jython. I see you use jblas and cude – what’s your experience with the performance issues of the JNI that are often cited as an argument against using the JVM for scientific application?

  8. FYI, there’s some discussion of this post on the numpy list just now (since it was pointed out to the numpy devs :-)):

  9. […] is not really an answer, but this blog discusses in length the problems of having a numpy ecosystem that evolves fast, at the expense of […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: