The future of the Scientific Python ecosystem

SciPy 2015 is over, meaning that many non-participants like myself are now busy catching up with what happened by watching the videos. Today’s dose for me was Jake VanderPlas’ keynote entitled “State of the Tools”. It’s about the history, current state, and potential future of what is now generally known as the Scientific Python ecosystem: the large number of libraries and tools written in or for Python that scientists from many disciplines use to get their day-to-day computational work done.

History is done, the present status is a fact, but the future is open to both speculation and planning, so that’s what I find most interesting in Jake’s keynote. What struck me is that everything he discussed was about paying back technical debt: refactoring the core libraries, fixing compatibility problems, removing technical obstacles to installation and use of various tools. In fact, 20 years after Python showed up in scientific computing, the ecoystem is in a state that is typical for software projects of that age: a bit of a mess. The future work outlined by Jake would help to make it less of a mess, and I hope that something like this will actually happen. The big question mark for me is how this can be funded, given that it is “only” maintenance work, producing nothing fundamentally new. Fortunately there are people much better than me at thinking about funding, for example everyone involved in the NumFOCUS foundation.

Jake’s approach to outlining the future is basically “how can we fix known problems and introduce some obvious improvements” (but please do watch the video to get the full story!). What I’d like to present here is an alternate approach: imagine an ideal scientific computing environment in 2015, and try to approximate it by an evolution of the current SciPy ecosystem while retaining a sane level of backwards compatibility. Think of it as the equivalent of Python 3 at the level of the core of the scientific ecosystem.

One aspect that has changed quite a bit over 20 years is the interaction between Python and low-level code. Back then, Python had an excellent C interface, which also worked well for Fortran 77 code, and the ease of wrapping C and Fortran libraries was one of the major reasons for Python’s success in scientific computing. We have seen a few generations of wrapper code generators, starting with SWIG, and the idea of a hybrid language called Pyrex that was the ancestor of today’s Cython. LLVM has been a major game changer, because it permits low-level code to be generated and compiled on-the-fly, without explicitly generating wrappers and compiling code. While wrapping C/C++/Fortran libraries still remains important, the equally important task of writing low-level code for performance can be handled much better with such tools. Numba is perhaps the best-known LLVM-based code generator in the Python world, providing JIT compilation for a language that is very similar to a subset of Python. But Numba is also an example of the mindset that has led to the current mess: take the existing ecosystem as given, and add a piece to it that solves a specific problem.

So how would one approach the high-/low-level interface today, having gained experience with LLVM and PyPy? Some claim that the distinction doesn’t make sense any more. The authors of the Julia language, for example, claim that it “avoids the two-language problem”. However, as I have pointed out on this blog, Julia is fundamentally a performance-oriented low-level language, in spite of having two features, interactivity and automatic memory management, that are traditionally associated with high-level languages. By the way, I don’t believe the idea of a both-high-and-low-level language is worth pursuing for scientific computing. The closest realization of that idea is Common Lisp, which is as high-level as Python, perhaps more so, and also as low-level as Julia, but at the cost of being a very complex language with a very steep learning curve, especially for mastering the low-level aspects. Having two clearly distinct language levels makes it possible to keep both of them manageable, and the separation line serves as a clear warning sign to scientists, who should not attempt to cross it without first acquiring some serious knowledge about software development.

The model to follow, in my opinion, is the one of Lush and Terra. They embed a low-level language into a high-level language in such a way that the low-level code is a data structure at the high level. You can use literals for this data structure and get the equivalent of Numba. But you can also write code generators that specialize low-level code for a given problem. Specialization allows both optimization and simplification, both of which are desirable. The low-level language would have arrays as a primitive data structure, and both NumPy and Pandas, or evolutions such as xray, would become shallow Python APIs to such low-level array functionality. I think this is much more powerful than today’s Numba building on NumPy. Moreover, wrapper generators become simple plain Python code, making the construction of interfaces to complex libraries (think of h5py) much easier than it is today. Think of it as ctypes on steroids. For more examples of what one could do with such a system, look at metaprogramming in Julia, which is exactly the same idea.

Another aspect that Jake talks about in some detail is visualization. There again, two decades of code written by people scratching their own itches has led to a mess of different libraries with a lot of overlap and no clear distinctive features. For cleaning it up, I propose the same approach: what are the needs and the available technologies for scientific visualization in 2015? We clearly want to profit from all the Web-based technologies, both for portability (think of mobile platforms) and for integration with Jupyter notebooks. But we also need to be able to integrate visualization into GUI applications. From the API point of view, we need something simple for simple plots (Toyplot looks promising), but also more sophisticad APIs for high-volume data visualization. The main barrier to overcome, in my opinion, is the current dominance of Matplotlib, which isn’t particularly good in any of the categories I have outlined. Personally, I don’t believe that any evolution of Matplotlib can lead to something pleasant to use, but I’d of course be happy to be proven wrong.

Perhaps the nastiest problem that Jake addresses is packaging. He seems to believe that conda is the solution, but I don’t quite agree with that. Unless I missed some recent evolutions, a Python package prepared for installation through conda can only be used easily with a Python distribution built on conda as well. And that means Anaconda, because it’s the only one. Since Anaconda is not Open Source, there is no way one can build a Python installation from scratch using conda. Of course, Anaconda is perfectly fine for many users. But if you need something that Anaconda does not provide, you may not be able to add it yourself. On the Mac, for example, I cannot compile C extensions compatible with Anaconda, because Mac Anaconda is built for compatibility with ancient OSX versions that are not supported by a standard XCode installation. Presumably that can be fixed, but I suspect that would be a major headache. And then, how about platforms unsupported by Anaconda?

Unfortunately I will have to leave this at the rant level, because I have no better proposition to make. Packaging has always been a mess, and will likely remain a mess, because the underlying platforms on which Python builds are already a mess. Unfortunately, it’s becoming more and more of a problem as scientific Python packages grow in size and features. It’s gotten to the point where I am not motivated to figure out how to install the latest version of nMOLDYN on my Mac, although I am a co-author of that program. The previous version is good enough for my own needs, and much simpler to install though already a bit tricky. That’s how you get to love the command line… in 2015.

Explore posts in the same categories: Computational science, Programming

Tags:

You can comment below, or link to this permanent URL from your own site.

13 Comments on “The future of the Scientific Python ecosystem”

  1. Josh Says:

    Unless I’m misunderstanding your argument, while packages developed for Conda can’t be installed by other package management systems currently, Conda itself is totally open source (https://github.com/conda) and Continuum provides the recipes for most of the packages it distributes (https://github.com/conda/conda-recipes). I’ve also built C extensions on OSX that just use the standard gcc distributed as part of the command line tools, which you get either via XCode or as a standalone installer.

    Or is there a subtler issue that I am not fully appreciating?


    • Conda is indeed open source, but that’s not very relevant in my opinion. You need a Python installation based on conda if you want to distribute add-ons through conda.

      If you got gcc with XCode, you have a much older version than MacOSX than I have. With both 10.9 and 10.10, clang is the only compiler in XCode. I haven’t managed yet to compile C extension modules on these two systems. Any attempt ends with a linker error message concerning the shared libraries in Anaconda.

      • Josh Says:

        I do most of my development of OSX 10.9.5, so the gcc I’m using is the LLVM-based clang flavor. I tend not to compile C extensions that have been written by hand, but do use Cython in combination with setuptools that links other shared libraries almost daily and that works without issue. Cython is generating a valid Python C-extension, so I’m wondering what the exact difference is. Perhaps building a large Python project with lots of C-extensions via a different compile system like Make or CMake would have important differences, but you might be able to crib the correct flags by mimicking what setuptools is doing. I’ll admit to having no experience with such a build system.

        Overall, for me at least, Conda has solved the packaging/sandboxed environments issues that were persistent before.


      • I had another look at my notes. The problem is linking to shared libraries outside of Python itself. For example, I didn’t manage to install a C extension module that uses the netCDF shared library shipped with Anaconda. Pure Cython (or C) modules with no dependencies other than Python work fine for me as well.

      • Aaron Meurer Says:

        This is an issue we are working on. It is irrelevant to the openness/closedness of conda and the Anaconda packages.

  2. Dan Farmer Says:

    Interesting as usual. Funny enough Torch7 is the successor of Lush, seems they went closer to the scipy route than the lispy route.

  3. Aaron Meurer Says:

    If you want to, you can build your own Python distribution completely from scratch using conda and conda-build (both completely BSD open source). It’s not super easy (there are some bootstrapping issues to consider), and completely a waste of time for 99% of people. That’s why Continuum provides pre-built binaries in the form of Anaconda and Miniconda. If you want to build your own packages reusing the base conda ecosystem (i.e., not bootstrapping yourself), it is very easy. All you need is a conda recipe (there are already a ton at https://github.com/conda/conda-recipes).

    If something doesn’t work with conda, please open an issue at https://github.com/conda/conda/issues. If something seems to be broken with one of the Continuum built packages, please open an issue at https://github.com/continuumio/anaconda-issues.


    • The problems I described are documented in issue #50 and issue #46 for Anaconda. Both date from December 2013 and are still open.

      It is true that anyone can build a Python distribution based on conda, just like anyone can write a replacement for setuptools. Neither is a realistic project for most users of scientific Python users.

      Anaconda is a great Python distribution, which has in particular made my teaching much easier. I certainly don’t want to criticize it. But no particular binary distribution of Python can ever be “the answer” to the packaging problem. Sometimes you just need to build from source.

  4. teoliphant Says:

    Konrad, your opinion is always respected but on the packaging side you are completely overlooking several things and causing FUD where none should exist.

    The basic tools for reproducing your environment with conda packages have been provided and it’s all open source. Actually reproducing an environment will always require effort from *somebody*, but it’s much nicer if we use command standards. Conda has a standard for meta-data, and for binary layout that is simple, completely open and easy for anyone to build tools around.

    Anaconda is a proof that it can be done, but it is certainly not the only one that needs to be done. At least two other distributions using conda have been created.

    In addition, with pip install conda; conda create … you can build your own conda-based distribution without using Anaconda at all. There may be some minor issues to work out, but it’s all completely doable.

    Platforms are not difficult to add to conda. We’ve done Power8 and ARM already — patches to conda to support others are welcome and encouraged. In addition, an innumerable number of minor platforms with “features” are already available.

    The answers to your concerns are not to discount or ignore conda, but to augment the open source community around conda. If you want to ignore Anaconda that’s fine, but don’t discount a solution that works.


    • I pretty much agree with all of this. As I already replied to Aaron’s comment, I don’t see anything fundamentally wrong with either conda or Anaconda, but even a small detail can be a show-stopper. As someone who needs development snapshots of h5py, I can’t use a Python distribution for daily work in which I cannot install h5py from source. I understand that this probably doesn’t matter for 99% of scientific Python users. The only point I wanted to make is that there is not “the solution” to the packaging problem. The real problem can probably only be solved at the OS level anyway, and only after everyone agrees on a single OS… so I am not holding my breath.

      To see full problem, you have to talk to “plain scientists” who use our software without being particularly interested in the technical details. Most of the people I talk to all day long would not even understand my blog post nor your commment on it nor the reply I am just writing. They don’t understand the problem to which conda environments are a partial solution, and yet they suffer from its consequences. For them, building their own conda-based distribution is *not* doable, nor would they see if, how, and when this could be a solution to their problem.

      • ijstokes Says:

        I am trying to understand if we’re discussing a problem that is faced by 99% of scientific Python users, 1% of scientific Python users, or imagining a future solution that will work for 100% of scientific Python users. My background is as a computational scientist, and with similar experience to the author’s (protein structure determination, and computational molecular biology). Many scientists relying on computational infrastructure are simply faced with the often insurmountable challenge of just getting a piece of software to work. I did a postdoc with SBGrid at Harvard Medical School, a consortium setup to support 5000+ researchers so they can benefit from having a stack of over 200 software tools that “just work”, rather than having to figure out how to download, configure, and compile complex dependencies. They are not trying to run the nightly commit of a particular package. Conda is designed to help those same people, so a research computing team (either third-party, or the maintainers of the software themselves) can create an easily installable binary software package that a non-software-engineer can get onto their laptop or lab servers without having to learn the nuances of Makefiles, autoconf, pkgconfig, and gcc. And as was already pointed out, anyone can use Conda with a standard Python distribution (“pip install conda”), and it will work alongside other packages that are not conda-based.


      • Please look at the start of this discussion in my blog post: My understanding was that Jake considered conda “the solution” to the Python packaging problem, and I disagreed with that because there are situations, in my own personal experience, where conda is not a solution. I have no idea which percentage of users/installations/whatever are concerned, and I don’t care. I *suspect* (but I’d be happy to be proven wrong) that many people involved in software development run into similar problems as those I have described. I’d also be happy to be proven wrong in claiming that my problem has no solution within conda.

  5. ijstokes Says:

    I missed my disclosure: I am now a Computational Scientist at Continuum.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: