The ultimate calculator for Android and iOS

Posted September 7, 2012 by Konrad Hinsen
Categories: Computational science, Programming

Calculators are among the most popular applications for smartphones, and therefore it is not surprising that the Google Play Store has more than 1000 calculators for the Android platform. Having used HP’s scientific calculators for more than 20 years, I picked RealCalc when I got my Android phone and set it to RPN mode. It works fine, I have no complaints about it. But I no longer use it because I found something much more powerful.

It’s called “J“, which isn’t exactly a very descriptive name. And that’s probably a good idea because describing it it not so easy. J is much more than a calculator, but it does the calculator job very well. It’s actually a full programming language, but one that differs substantially from everything else that goes by that label. The best description for J I can come up with is “executable mathematical notation”. You type an expression, and you get the result. That’s in fact not very different from working interactively with Python or Matlab, except that the expressions are very different. You can write traditional programs in J, using loops, conditionals, etc., but you can a lot of work done without ever using these features.

The basic data structure in J is the array, which can have any number of dimensions. Array elements can be numbers, characters, or other arrays. Numbers (zero-dimensional arrays) and text strings (one-dimensional arrays of characters) are just special cases. In J jargon, which takes its inspiration from linguistics, data items are called “nouns”. Standard mathematical operators (such as + or -) are called “verbs” and can have one or two arguments (one left, one right). An expression is called a “sentence”. There are no precedence rules, the right argument of any verb being everything to its right. Given the large number of verbs in J, this initially unfamiliar rule makes a lot of sense. A simple example (also showing the use of arrays) is

   2 * 3 + 10 20 30
26 46 66

Up to here, J expressions are not very different from Python or Matlab expressions. What J doesn’t have is functions with the familiar f(x, y, z) syntax, accepting any number of arguments. There are only verbs, with one or two arguments. But what makes J really different from the well-known languages for scientific computing are the “parts of speech” that have no simple equivalent elsewhere: adverbs and conjunctions.

An adverb takes a verb argument and produces a derived verb from it. For example, the adverb ~ takes a two-argument verb (a dyad in J jargon) and turns it into a one-argument verb (a monad) that’s equivalent to using the dyad with two equal arguments. With + standing for plain addition, +~ thus doubles its argument:

   +~ 1 5 10 20
2 10 20 40

meaning it is the same as

   1 5 10 20 + 1 5 10 20
2 10 20 40

A conjunction combines a verb with a noun or another verb to produce a derived verb. An example is ^:, the power conjunction, which applies a verb several times:

   +~(^:2) 1 5 10 20
4 20 40 80
   +~(^:3) 1 5 10 20
8 40 80 160

The parentheses are required to separate the argument of the power conjunction (2 or 3) from the array that is the argument to the resulting derived verb. To see the real power of the power conjunction, consider that it accepts negative arguments as well:

   +~(^:_1) 1 5 10 20
0.5 2.5 5 10

You have seen right: J can figure out that the inverse of adding a number to itself is dividing that number by two!

Pretty much any programming language permits you to assign values to names for re-use in later expressions. J is no exception:

   data =. 1 5 10 20
   double =. +~
   double data
2 10 20 40
   inv =. ^:_1
   halve =. double inv
   halve data
0.5 2.5 5 10

As you can see, names can be given not just to nouns (i.e. data), but also to verbs, adverbs, and conjunctions. Most J programs are just pieces of expressions that are assigned to names. Which means that the short summary of J that I have given here could well be all you ever need to know about the language – apart from the fact that you will have to acquire a working knowledge of many more verbs, adverbs, and conjunctions.

Before you rush off to the Play Store looking for J, let me add that J is not yet there, although it’s supposed to arrive soon. For now, you have to download the APK and install it yourself, using your preferred Android file manager. I should also point out that J is not just for Android. It’s been around for more than 20 years, and you can get J for all the common computing platforms from Jsoftware. There’s also an iOS version for the iPhone and iPad. J’s extreme terseness is a perfect fit for smartphones, where screen space is a scarce resource and where every character you don’t have to type saves you a lot of time.

The Nix package manager in computational science

Posted May 14, 2012 by Konrad Hinsen
Categories: Reproducible research

In an earlier post, I mentioned the Nix package management system as a candidate for ensuring reproducibility in computational science. What distinguishes Nix from the better known package managers (Debian, RPM, …) is that it permits the installation of different versions of the same package in parallel, with a dependency tracking system that refers to a precise version of everything, including the versions of the development tools (compilers, …) that were used to build the libraries and executables. Nix thus remembers for each package the complete details of how it can be reconstructed, which is what we would like to see for ensuring reproducibility.

There are, however, two caveats. First of all, Nix was designed for software installation management and not for computation. While in principle one could define the results (figures, tables, datasets) of some computation as a Nix package and perform the computation by installing the package, such an approach is quite cumbersome with the Nix support tools designed with a different task in mind. However, computation-specific support tools would probably suffice to fix this. Second, while the design of Nix looks quite sound, it is a young project with much less manpower behind it than the big package managers of the Linux world. This means there are fewer package definitions and they are overall less reliable. For example, I haven’t yet managed to install my research computing environment (Python, NumPy, matplotlib, plus a few more packages) using Nix under MacOS X, because some packages simply fail to build. Again this is not an insurmountable problem, but it requires some serious effort to fix.

The Nix documentation is pretty good at describing how to use the package manager and the collection of package definitions for Linux and MacOS X named Nixpkgs. It is not so good at giving a basic understanding of how Nix works, which becomes important when you want to use it for something else than traditional package management. The following overview is the result of my own explorations of Nix. I am not a Nix authority, so be warned that there may be mistakes or misunderstandings.

At the heart of Nix is the “Nix store”, a central database where everything managed by Nix is kept. Its default location is /nix/store and if you look at it you see an overwhelmingly long list of crypic filenames. Let’s zoom in on something to see what’s going on. Here is what ls -l /nix/store/*zlib* shows on my machine:

-r--r--r-- 1 hinsen staff 1000 Jan  1  1970
-r--r--r-- 1 hinsen staff 1181 Jan  1  1970
dr-xr-xr-x 5 hinsen staff  170 Jan  1  1970
-r--r--r-- 1 hinsen staff 1000 Jan  1  1970
-r--r--r-- 1 hinsen staff 1686 Jan  1  1970

The single directory in that list actually contains the zlib installation in the familiar Unix file layout that you find under /usr or /usr/local:

~> ls -R /nix/store/mrdqnzzr80rkfnm59q6aywdba6776f66-zlib-1.2.6
include  lib  share

zconf.h  zlib.h

libz.1.2.6.dylib  libz.1.dylib	libz.a	libz.dylib  pkgconfig





Note that it contains just zlib, and nothing else, in particular not zlib‘s dependencies. Each library or application has its own directory in the Nix store.

Next, let’s look at all the other files, those with the extension .drv (for “derivation”, a Nix term for any artefact derived from human-provided input). There are three files that end in zlib-1.2.6.drv and one that ends in zlib-1.2.6.tar.gz.drv. Let’s look at the contents of the last one first. I have made it more readable by adding whitespace:




    ("impureEnvVars","http_proxy https_proxy ftp_proxy all_proxy no_proxy NIX_CURL_FLAGS NIX_HASHED_MIRRORS NIX_MIRRORS_apache NIX_MIRRORS_bitlbee NIX_MIRRORS_cpan NIX_MIRRORS_debian NIX_MIRRORS_fedora NIX_MIRRORS_gcc NIX_MIRRORS_gentoo NIX_MIRRORS_gnome NIX_MIRRORS_gnu NIX_MIRRORS_gnupg NIX_MIRRORS_hashedMirrors NIX_MIRRORS_imagemagick NIX_MIRRORS_kde NIX_MIRRORS_kernel NIX_MIRRORS_metalab NIX_MIRRORS_oldsuse NIX_MIRRORS_opensuse NIX_MIRRORS_postgresql NIX_MIRRORS_savannah NIX_MIRRORS_sf NIX_MIRRORS_sourceforge NIX_MIRRORS_ubuntu NIX_MIRRORS_xorg"),
    ("urls"," mirror://sourceforge/libpng/zlib/1.2.6/zlib-1.2.6.tar.gz")])

If that looks like a computational expression in a programming language, that’s because it is. Don’t worry, it’s not something you are expected to write yourself, these expressions are created from the package definitions written in a more user-friendly syntax called “Nix expressions”, which is very well documneted in the Nix documentation.. The expression shown above defines how to make (or “realise” in Nix jargon) the derivation /nix/store/s9qgdh7g22nx433y3lk62igm5zh48dxj-zlib-1.2.6.tar.gz, which is a rather simple one because the file is simply downloaded and verified for a known checksum. But even such a simple derivation has dependencies: the “standard environment” stdenv and the list of download mirror sites, mirrors-list.

It’s time to say something about those funny 32-character prefixes in all the file names in the Nix store. You may have noticed that the zlib file list above contains two entries for zlib-1.2.6.drv that are identical except for this prefix. It looks as if the prefix is there to distinguish things that would otherwise be identical. This is true, and the information encoded in the prefix (which is a hash code) is the complete set of dependencies. The two zlib derivations differ in the version of the standard environment they were built with. I have both of these in my Nix store because I have played around with different releases of Nixpkgs. Nix really tries to keep track of every single dependency, including the exact versions of the various tools (mainly compilers) that were used in building a binary installation. That means you can keep lots of different versions of every single item on your system at the same time, and trace back exactly how they were built. You can also send a copy of the relevant derivation files (those with the .drv extension) to someone else, who can reproduce the exact same environment by “realising” those derivations again.

With so many zlibs floating around, which one does Nix use when you ask it to install some application that uses zlib? The one you specify. When some application requires zlib as a dependency, you have to tell Nix exactly which zlib derivation you want to be used. You don’t normally do this manually for every single build (though you could), you’d rather use a coherent set of package definitions (such as Nixpkgs) that specifies all the interdependencies among hundreds of packages. The package definitions take the form of “Nix expressions”, which are written in a language specifically designed for this purpose. Files containing Nix expressions have the extension .nix. Since the language is rather well documented in the Nix manual, I won’t say any more about it here. A good starting point is to explore Nixpkgs. It helps to know that the central file is pkgs/top-level/all-packages.nix. This file imports the definitions of individual packages from their respective packages and makes a consistent package collection from them. When you build a particular derivation from Nixpkgs, only the packages listed explicitly as its dependencies are available in the build environment that is set up specifically for this build operation. No “default library” (such as /usr/lib) is used at all.

There is one more layer to Nix, whose role is twofold: making it convenient for users to work with programs installed through Nix, and pemitting to remove packages that were installed but are no longer needed.
Let’s start with the second aspect because it is the simpler one: packages can be removed as soon as nobody needs them any more. This requires a way to figure out which packages are still needed. Obviously the packages that some user on the system wants to access are “needed”, and that’s why cleanup is related to user profiles which I will cover in a minute. The remaining needed packages are the dependencies of other needed packages. So once we know the packages that all users put together request to use, we can figure out which packages can safely be deleted. This clean-up operation is called “garbage collection” and handled by the command nix-store --gc.

Nix user environments are managed using the command nix-env, and if you don’t care about how Nix works, that command is the only one you may ever need. Each user has his/her own environment, of course, which consists mainly of a directory named $HOME/.nix-profile. That directory contains subdirectories called bin, lib, man etc. whose names should sound familiar. They contain nothing but symbolic links into the Nix store. These links define which package the user actually accesses, by putting $HOME/.nix-profile/bin on th3 PATH environment variable. When you use nix-env to install a package, Nix builds it and puts it into the Nix store (unless it’s already there), and then creates symbolic links in your Nix profile, which may replace links to some different version of a package. It is important to understand that your use profile never enters into the build process of any Nix derivation. Your profile is exclusively for your own use and has no impact on Nix package management other than protecting the packages you use from being removed during garbage collection.

So far for a first report on my exploration of Nix. I will continue trying to get my computational environment built with Nix, so that I can start to explore how to use it for reproducible computations. Watch this space for news.

PS: After I published this post initially, the friendly people on the Nix mailing list pointed out some additional material for learning about Nix. First of all, there is Eelco Dolstra’s thesis entitled “The Purely Functional Software Deployment Model”, which is what you should read if you really want to know everything about Nix. There’s also Sander van der Burg’s blog which has some very detailed posts about Nix and what it can be used for. You could start with this introduction.

Unifying version control and dependency management for reproducible research

Posted April 10, 2012 by Konrad Hinsen
Categories: Programming, Reproducible research

Tags: ,

When the Greek philosopher Heraclitus pronounced his famous “πάντα ῥεῖ” (everything flows), he most probably was not thinking about software. But it applies to software as much as to other aspects of life: software is in perpetual change, being modified to remove bugs, add features, and adapt it to changing environments. The management of change is now a well-established part of software engineering, with the most emblematic tool being version control. If you are developing software without using version control, stop reading this immediately and learn about Mercurial or Git, the two best version control systems available today. That’s way more important than reading the rest of this post.

Software developers use version control to keep track of the evolution of their software, to coordinate team development, and to manage experimental features. But version control is also of interest for software users: it permits them to refer to a specific version of a piece of software they use in a unique and reproducible way, even if that version is not the current one, nor perhaps even an official numbered release. In fact, official numbered releases are becoming a relict of the past. They make little sense in an Open Source universe where everyone has access to source code repositories under version control. In that situation, an official release is nothing but a bookmark pointing to a specific commit number. There is no need for a release number.

Why would you want to refer to a specific version of a piece of software, rather than always use the latest one? There are many reasons. As software evolves, some bugs get fixed but others sneak in. You may prefer the bugs you know to the ones that could surprise you. Sometimes later versions of some software are not fully compatible with their predecessors, be it by design or by mistake. And even if you want to use the very latest version at any time, you might still want to note which version you used for a specific application. In scientific computing, this is one of the fundamental principles of reproducible research: note carefully, and publish, the exact versions of all pieces of software that were used for obtaining any published research result. It’s the only way for you and others to be able to understand exactly what happened when you look at your work many years later.

Another undeniable reality of modern software, in particular in the Open Source universe, is that it’s modular. Developers use other people’s software, especially if it’s well written and has the reputation of being reliable, rather than reinventing the wheel. The typical installation instructions of a piece of Open Source software start with a list of dependencies, i.e. packages you have to install before you can install the current one. And of course the packages in the dependency list have their own dependency list. The number of packages to install can be overwhelming. The difficulties of dependency management are so widespread that the term “dependency hell” has been coined to refer to them.

Systems programmers have come up with a solution to that problem as well: dependency management tools, better known as package managers. Such tools keep a database of what is installed and which package depends on which other ones. The well-known Linux distributions are based on such package managers, of which the ones developed by Debian and RedHat are the most popular ones and are now used by other distributions as well. For MacOS X, MacPorts and Fink are the two most popular package managers, and I suspect that the Windows world has its own ones.

One of the major headaches that many computer users face is that version management and dependency management don’t cooperate. While most package managers permit to state a minimal version number for a dependency, they don’t permit to prescribe a precise version number. There is a good reason for this: the way software installation is managed traditionally on Unix systems makes it impossible to install multiple versions of the same package in parallel. If packages A and B both depend on C, but require different versions of it, there is simply no simple solution. Today’s package managers sweep this problem under the rug and pretend that higher version numbers are always as least as good as their predecessors. They will therefore install the higher of the two version numbers required by A and B, forcing one of them to use a version different from its preference.

Anyone who has been using computers intensively for a few years has probably run into such a problem, which manifests itself by some program not working correctly any more after another one, seemingly unrelated, has been installed. Another variant is that an installation fails because some dependency is available in a wrong version. Such problems are part of “dependency hell”.

This situation is particularly problematic for the computational scientist who cares about the reproducibility of computed results. At worst, verifying results from 2005 by comparing to results from 2009 can require two completely separate operating system installations running in separate virtual machines. Under such conditions, it is difficult to convince one’s colleagues to adopt reproducible research practices.

While I can’t propose a ready-to-use solution, I can point out some work that shows that there is hope for the future. One interesting tool is the Nix package manager, which works much like the package managers by Debian or RedHat, but permits installing multiple versions of the same package in parallel, and registers dependencies with precise versions. It could be used as a starting point for managing software for reproducible research, the main advantage being that it should work with all existing software. The next step would be to make each result dataset or figure a separate “package” whose complete dependency list (software and datasets) is managed by Nix with references to precise version numbers. I am currently exploring this approach; watch this space for news about my progress.

For a system even better suited to the needs of reproducible computational science, I refer to my own ActivePapers framework, which combines dependency management and version control for code and data with mechanisms for publishing code+data+documentation packages and re-use code from other publications in a secure way. I have to admit that it has a major drawback as well: it requires all code to run on the Java Virtual Machine (in order to guarantee portability and secure execution), which unfortunately means that most of today’s scientific programs cannot be used. Time will tell if scientific computing will adopt some virtual machine in the future that will make such a system feasible in real life. Reproducible research might actually become a strong argument in favour of such a development.

Julia: a new language for scientific computing

Posted April 4, 2012 by Konrad Hinsen
Categories: Programming

New programming languages are probably invented every day, and even those that get developed and published are too numerous to mention. New programming languages developed specifically for science and engineering are very rare, however, and that’s why such a rare event deserves some publicity. A while ago, I saw an announcement for Julia, which announces itself as “a fresh approach to technical computing”. I couldn’t resist the temptation to download, install, and test-drive it. Here are my first impressions.

The languages used today for scientific computing can be grouped into four categories:

  • Traditional compiled languages optimized for number crunching. The big player in this category is of course Fortran, but some recent languages such as X10, Chapel, or Fortress are trying to challenge it.
  • Rapid-development domain-specific languages, usually interpreted. Well-known examples are Matlab an R.
  • General-purpose statically compiled languages with libraries for scientific computing. C and C++ come to mind immediately.
  • General-purpose dynamic languages with libraries for scientific computing. The number one here is Python with its vast library ecosystem.

What sets Julia apart is that it sits somewhere between the first two categories. It’s compiled, but fully interactive, there is no separate compilation phase. It is statically typed, allowing for efficient compilation, but also has the default type “Any” that makes it work just like dynamically typed languages in the absence of type declarations. Type infererence makes the mix even better. If that sounds like the best of both worlds, it actually is. It has been made possible by modern code transformation techniques that don’t really fit into the traditional categories of “compilers” and “interpreters”. Like many other recent languages and language implementations, Julia uses LLVM as its infrastructure for these code transformations.

Julia has a well-designed type system with a clear orientation towards maths and number crunching: there is support for complex numbers, and first-class array support. What may seem surprising is that Julia is not object-oriented. This is neither an oversight nor a nostalgic return to the days of Fortran 77, but a clear design decision. Julia has type hierarchies and function polymorphism with dispatch on the types of all arguments. For scientific applications (and arguably for some others), this is more useful than OO style method dispatch on a single value.

Another unusual feature of Julia is a metaprogramming system that is very similar to Lisp macros, although it is slightly more complicated by the fact that Julia has a traditional syntax layer, whereas Lisp represents code by data structures.

So far for a summary of the language. The real question is: does it live up to its promises? Before I try to answer that question, I would like to point out that Julia is a young language that is still in flux and for now has almost no development tool support. For many real-life problems, there is no really good solution at the moment but it is clear that a good solution can be provided, it just needs to be done. What I am trying to evaluate is not if Julia is ready for real-life use (it is not), but whether there are any fundamental design problems.

The first question I asked myself is how well Julia can handle non-scientific applications. I just happened to see a blog post by John D. Cook explaining why it’s preferable to write math in a general-purpose language than to write non-math in a math language. My experience is exactly the same, and that’s why I have adopted Python for most of my scientific programming. The point is that any non-trivial program sooner or later requires solving non-math problems (I/O, Web publishing, GUIs, …). If you use a general-purpose language, you can usually just pick a suitable library and go ahead. With math-only languages such as Matlab, your options are limited, with interfacing to C code sometimes being the only way out.

So is it feasible to write Web servers or GUI libraries in Julia? I would say yes. All the features of general-purpose languages are there or under consideration (I am thinking in particular of namespaces there). With the exception of systems programming (device drivers and the like), pretty much every programming problem can be solved in Julia with no more effort than in most other languages. The real question is if it will happen. Julia is clearly aimed at scientists and engineers. It is probably good enough for doing Web development, but it has nothing to offer for Web developers compared to well-established languages. Will scientists and engineers develop their own Web servers in Julia? Will Web developers adopt Julia? I don’t know.

A somewhat related question is that of interfacing to other languages. That’s a quick way to make lots of existing code available. Julia has a C interface (which clearly needs better tool support, but I am confident that it will come), which can be used for other sufficiently C-like languages. It is not clear what effort will be required to interface Julia with languages like Python or Ruby. I don’t see why it couldn’t be done, but I can’t say yet whether the result will be pleasant to work with.

The second question I explored is how well Julia is suited to my application domain, which is molecular simulations and the analysis of experimental data. Doing molecular simulation in Julia looks perfectly feasible, although I didn’t really implement any non-trivial algorithm yet. What I concentrated on first is data analysis, because that’s where I could profit most from Julia’s advantages. The kinds of data I mainly deal with are (1) time series and frequency spectra and (2) volumetric data. For time series, Julia works just fine. My biggest stumbling block so far has been volumetric data.

Volumetric data is usually stored in a 3-dimensional array where each axis corresponds to one spatial dimension. Typical operations on such data are interpolation, selection of a plane (2-d subarray) or line (1-d subarray), element-wise multiplication of volume, plane, or line arrays, and sums over selected regions of the data. Using the general-purpose array systems I am familiar with (languages such as APL, libraries such as NumPy for Python), all of this is easy to handle.

Julia’s arrays are different, however. Apparently the developers’ priority was to make the transition to Julia easy for people coming from Matlab. Matlab is based on the principle that “everything is a matrix”, i.e. a two-dimensional array-like data structure. Matlab vectors come on two flavors, row and column vectors, which are actually matrices with a single row or column, respectively. Matlab scalars are considered 1×1 matrices. Julia is different because it has arrays of arbitrary dimension. However, array literals are made to resemble Matlab literals, and array operations are designed to behave as similar as possible to Matlab operations, in particular for linear algebra functions. In Julia, as in Matlab, matrix multiplication is considered more fundamental than elementwise multiplication of two arrays.

For someone used to arrays that are nothing more than data structures, the result looks a bit messy. Here are some examples:

julia> a = [1; 2]
[1, 2]

julia> size(a)

julia> size(transpose(a))

julia> size(transpose(transpose(a)))

I’d expect that the transpose of the transpose is equal to the original array, but that’s not the case. But what does transpose do to a 3d array? Let’s see:

julia> a = [x+y+z | x=1:4, y=1:2, z = 1:3]
4x2x3 Int64 Array:

ulia> transpose(a)
no method transpose(Array{Int64,3},)
 in method_missing at base.jl:60

OK, so it seems this was not considered important enough, but of course that can be fixed.

Next comes indexing:

julia> a = [1 2; 3 4]
2x2 Int64 Array:
 1  2
 3  4

julia> size(a)

julia> size(a[1, :])

julia> size(a[:, 1])

julia> size(a[1, 1])

Indexing a 2-d array with a single number (all other indices being the all-inclusive range :) yields a 2-d array. Indexing with two number indices yields a scalar. So how do I extract a 1-d array? This generalizes to higher dimensions: if the number of number indices is equal to the rank of the array, the result is a scalar, otherwise it’s an array of the same rank as the original.

Array literals aren’t that frequent in practice, but they are used a lot in development, for quickly testing functions. Here are some experiments:

julia> size([1 2])

julia> size([1; 2])

julia> size([[1;2] ; [3;4]])

julia> size([[1;2] [3;4]])

julia> size([[1 2] [3 4]])

julia> size([[[1 2] [3 4]] [[5 6] [7 8]]])

Can you guess the rules? Once you have them (or looked them up in the Julia manual), can you figure out how to write a 3-d array literal? I suspect it’s not possible.

Next, summing up array elements:

julia> sum([1; 2])

julia> sum([1 2; 3 4])

Apparently sum doesn’t care about the shape of my array, it always sums the individual elements. Then how do I do a sum over all the rows?

I have tried to convert some of my basic data manipulation code from Python/NumPy to Julia, but found that I always spent most of the time fighting against the built-in array operations, which are clearly not made for my kind of application. In some cases a change of attitude may be sufficient. It seems natural to me that a plane extracted from volumetric data should be a 2-d array, but maybe if I decide that should be a 3-d array of “thickness” 1, everything will be easy.

I haven’t tried yet, because I know there are cases that cannot be dealt with in that way. Suppose I have a time series of volumetric data that I store in a 4-d array. Obviously I want to be able to apply functions written for static volumetric data (i.e. 3-d arrays) to an element of such a time series. Which means I do need a way to extract a 3-d array out of a 4-d array.

I hope that what I need is there and I just didn’t find it yet. Any suggestions are welcome. For now, I must conclude that test-driving Julia is a frustrating experience: the language holds so many promises, but fails for my needs due to superficial but practically very important problems.

Binary operators in Python

Posted March 29, 2012 by Konrad Hinsen
Categories: Programming

A two-hour train journey provided the opportunity to watch the video recording of the Panel with Guido van Rossum at the recent PyData Workshop. The lengthy discussion about PEP 225 (which proposes to add additional operators to Python that would enable to have both elementwise and aggregate operations on the same objects, in particular for providing both matrix and elementwise multiplication on arrays with a nice syntax) motivated me to write up my own thoughts about what’s wrong with operators in Python from my computational scientist’s point of view.

The real problem I see is that operators map to methods. In Python, a*b is just syntactic sugar for a.__mul__(b). This means that it’s the type of a that decides how to do the multiplication. The method implementing this operation can of course check the type of b, and it can even decide to give up and let b handle everything, in which case Python does b.__rmul__(a). But this is just a kludge to work around the real weakness of the operators-map-to-methods approach. Binary operators fundamentally require a dispatch on both types, the type of a and the type of b. What a*b should map to is __builtins__.__mul__(a, b), a global function that would then implement a binary dispatch operation. Implementing that dispatch would in fact be the real problem to solve, as Python currently has no multiple dispatch mechanisms at all.

But would multiple dispatch solve the issue addressed by PEP 225? Not at all, directly. But it would make some of the alternatives mentioned there feasible. A proper multiple dispatch system would allow NumPy (or any other library) to decide what multiplication of its own objects by a number means, no matter if the number is the first or the second factor.

More importantly, multiple dispatch would allow a major cleanup of many scientific packages, including NumPy, and even clean up the basic Python language by getting rid of __rmul__ and friends. NumPy’s current aggressive handling of binary operations is actually more of a problem for me than the lack of a nice syntax for matrix multiplication.

There are many details that would need to be discussed before binary dispatch could be proposed as a PEP. Of course the old method-based approach would need to remain in place as a fallback, to ensure compatibility with existing code. But the real work is defining a good multiple dispatch system that integrates well with Python’s dynamical type system and allows the right kind of extensibility. That same multiple dispatch method could then also be made available for use in plain functions.

Python becomes a platform

Posted March 15, 2012 by Konrad Hinsen
Categories: Programming

The recent announcement of clojure-py made some noise in the Clojure community, but not, as far as I can tell, in the Python community. For those who haven’t heard of it before, clojure-py is an implementation of the Clojure language in Python, compiling Clojure code to bytecode for Python’s virtual machine. It’s still incomplete, but already usable if you can live with the subset of Clojure that has been implemented.

I think that this is an important event for the Python community, because it means that Python is no longer just a language, but is becoming a platform. One of the stated motivations of the clojure-py developers is to tap into the rich set of libraries that the Python ecosystem provides, in particular for scientific applications. Python is thus following the path that Java already went in the past: the Java virtual machine, initially designed only to support the Java language, became the target of many different language implementations which all provide interoperation with Java itself.

It will of course be interesting to see if more languages will follow once people realize it can be done. The prospect of speed through PyPy’s JIT, another stated motivation for the clojure-py community, could also get more lanuage developers interested in Python as a platform.

Should Python programmers care about clojure-py? I’d say yes. Clojure is strong in two areas in which Python isn’t. One of them is metaprogramming, a feature absent from Python which Clojure had from the start through its Lisp heritage. The other feature is persistent immutable data structures, for which clojure-py provides an implementation in Python. Immutable data structures make for more robust code, in particular but not exclusively for concurrent applications.


Teaching parallel computing in Python

Posted February 6, 2012 by Konrad Hinsen
Categories: Programming

Every time I teach a class on parallel computing with Python using the multiprocessing module, I wonder if multiprocessing is really mature enough that I should recommend using it. I end up deciding for it, mostly because of the lack of better alternatives. But I am not happy at all with some features of multiprocessing, which are particularly nasty for non-experts in Python. That category typically includes everyone in my classes.

To illustrate the problem, I’ll start with a simple example script, the kind of example you put on a slide to start explaining how parallel computing works:

from multiprocessing import Pool
import numpy
pool = Pool()
print, range(100))

Do you see the two bugs in this example? Look again. No, it’s nothing trivial such as a missing comma or inverted arguments in a function call. This is code that I would actually expect to work. But it doesn’t.

Imagine your typical student typing this script and running it. Here’s what happens:

Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 232, in _bootstrap
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 232, in _bootstrap
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new

Python experts will immediately see what’s wrong: numpy.sqrt is not picklable. This is mostly an historical accident. Nothing makes it impossible or even difficult to pickle C functions such as numpy.sqrt, but since pickling was invented and implemented long before parallel computing, at a time when pickling functions was pretty pointless, so it’s not possible. Implementing it today within the framework of Python’s existing pickle protocol is unfortunately not trivial, and that’s why it hasn’t been implemented.

Now try to explain this to non-experts who have basic Python knowledge and want to do parallel computing. It doesn’t hurt of course if they learn a bit about pickling, since it also has a performance impact on parallel programs. But due to restrictions such as this one, you have to explain this right at the start, although it would be better to leave this for the “advanced topics” part.

OK, you have passed the message, and your students fix the script:

from multiprocessing import Pool
import numpy

pool = Pool()

def square_root(x):
    return numpy.sqrt(x)

print, range(100))

And then run it:

Process PoolWorker-1:
Traceback (most recent call last):
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 232, in _bootstrap
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 232, in _bootstrap
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 88, in run
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 88, in run
 self._target(*self._args, **self._kwargs)
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 57, in worker
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'

At this point, even many Python experts would start scratching their heads. In order to understand what is going on, you have to know how multiprocessing creates its processor pools. And since the answer (on Unix systems) is “fork”, you have to have a pretty good idea of Unix process creation to see the cause of the error. Which then allows to find a trivial fix:

from multiprocessing import Pool
import numpy

def square_root(x):
    return numpy.sqrt(x)

pool = Pool()

print, range(100))

Success! It works! But… how do you explain this to your students?

To make it worse, this script works but is still not correct: it has a portability bug because it doesn’t work under Windows. So you add a section on Windows process management to the section on Unix process management. In the end, you have spent more time explaining the implementation restrictions in multiprocessing than how to use it. A great way to reinforce the popular belief that parallel computing is for experts only.

These issues with multiprocessing are a classical case of a leaky abstraction: multiprocessing provides a “pool of worker processes” abstraction to the programmer, but in order to use it, the programmer has to understand the implementation. In my opinion, it would be preferable to have a less shiny API, but one which reflects the implementation restrictions. The pickle limitations might well go away one day (see PEP 3154, for example), but until this really happens, I’d prefer an API that does not suggest possibilities that don’t exist.

I have actually thought about this myself a long time ago, when designing the API of my own parallel computing framework for Python (which differs from multiprocessing in being designed for distributed-memory machines). I ended up with an API that forces all functions that implement tasks executed in parallel to be methods of a single class, or functions of a single module. My API also contains an explicit “run parallel job now” call at the end. This is certainly less elegant than the multiprocessing API, but it actually works as expected.

A rant about mail clients

Posted November 4, 2011 by Konrad Hinsen
Categories: Uncategorized

A while ago I described why migrated my agendas from iCal to orgmode. To sum it up, my main motivation was to gain more freedom in managing my information: where iCal imposes a rigid format for events and insists on storing them in its own database, inaccessible to other programs, orgmode lets me mix agenda information with whatever else I like in plain text files. Today’s story is a similar one, but without the happy end. I am as much fed up with mail clients as I was with iCal, and for much the same reasons, but I haven’t yet found anything I could migrate to.

From an information processing point of view, an e-mail message is not very different from lots of other pieces of data. It’s a sequence of bytes respecting a specific format (defined by a handful of standards) to allow its unambiguous interpretation by various programs in the processing chain. An e-mail message can perfectly well be stored in a file and in fact most e-mail clients permit saving a message to a file. Unfortunately, the number of e-mail clients able to open and display correctly such a file is already much smaller. But when it comes to collections of messages, information processing freedom ends completely.

Pretty much every mail client’s point of view is that all of a user’s mail is stored in some database, and that it (the client) is free to handle this database in whatever way it likes. The user’s only access to the messages is the mail client. The one and only. The only exception is server-based mail databases handled via the IMAP protocol, where multiple clients can work with a common database. If you don’t use IMAP, you have no control over how and where your mail is stored, who has access to it, etc.

What I’d like to do is manage mail just like I manage other files. A mailbox should just be a directory containing messages, one per file. Mailboxes could be stored anywhere in the file system. Mailboxes could be shared through the file system, and backed up via the file system. They could be grouped with whatever other information in whatever way that suits me. I would double-click on a message to view it, or double-click on a mailbox directory to view a summary, sorted in the way I like it. Or I would use command-line tools to work on a message or a mailbox. I’d pick the best tool for each job, just like I do when working with any other kind of file.

Why all that isn’t possible remains a mystery to me. The technology has been around for decades. The good old Maildir format would be just fine for storing mailboxes anywhere in the file system, as would the even more venerable mbox format. But even mail clients that use mbox or Maildir internally insist that all such mailboxes must reside in a single master directory. Moreover, they won’t let me open a mailbox from outside, I have to run the mail client and work through its hierarchical presentation of mailboxes to get to my destination.

Before I get inundated by comments pointing out that mail client X has feature Y from the list above: Yes, I know, there are small exceptions here and there. But unless I have the complete freedom to put my mail where I want it, the isolated feature won’t do me much good. If someone knows of a mail client that has all the features I am asking for, plus the features we all expect from a modern mail client, then please do leave a comment!

EuroSciPy 2011

Posted August 30, 2011 by Konrad Hinsen
Categories: Uncategorized

Another EuroSciPy conference is over, and like last year it was very interesting. Here is my personal list of highlights and comments.

The two keynote talks were particularly inspiring. On Saturday, Marian Petre reported on her studies of how people in general and scientists in particular develop software. The first part of her presentation was about how “expert” design and implement software, the definition of an expert being someone who produces software that actually works, is finished on time, and doesn’t exceed the planned budget. The second part was about the particularities of software development in science. But perhaps the most memorable quote of the keynote was Marian’s reply to a question from the audience of how to deal with unreasonable decisions coming from technically less competent managers. She recommended to learn how to manage management – a phrase that I heard repeated several times during the discussions along the conference.

The Sunday keynote was given by Fernando Perez. As was to be expected, IPython was his number one topic and there was a lot of new stuff to show off. I won’t mention all the new features in the recently released version 0.11 because they are already discussed in detail elsewhere. What I find even more exciting is the new Web notebook interface, available only directly from the development site at github. A notebook is an editable trace of an interactive session that can be edited, saved, stored in a repository, or shared with others. It contains inputs and outputs of all commands. Inputs are cells that can consist of more than one line. Outputs are by default what Python prints to the terminal, but IPython provides a mechanism for displaying specific types of objects in a special way. This allows to show images (in particular plots) inline, but also to turn SymPy expressions into mathematical formulas typeset in LaTeX.

A more alarming aspect of Fernando’s keynote was his statistical analysis of contributions to the major scientific libraries of the Python universe. In summary, the central packages are maintained by a grand total of about 25 people in their spare time. This observation caused a lot of debate, centered around how to encourage more people to contribute to this fundamental work.

Among the other presentations, as usual mostly of high quality, the ones that impressed me most were Andrew Straw’s presentation of ROS, the Robot Operating System, Chris Myers’ presentation about SloppyCell, and Yann Le Du’s talk about large-scale machine learning running on a home-made GPU cluster. Not to forget the numerous posters with lots of more interesting stuff.

For the first time, EuroSciPy was complemented by domain-specific satellite meetings. I attended PyPhy, the Python in Physics meeting. Physicists are traditionally rather slow in accepting new technology, but the meeting showed that a lot of high-quality research is based on Python tools today, and that Python has also found its way into physics education at various universities.

Finally, conferences are good also because of what you learn during discussions with other participants. During EuroSciPy, I discovered a new scientific journal called Open Research Computation , which is all about software for scientific research. Scientific software developers regularly complain about the lack of visibility and recognition that their work receives by the scientific community and in particular by evaluation and grant attribution committees. A dedicated journal might just be what we need to improve the situation. I hope this will be a success.

Executable Papers

Posted June 3, 2011 by Konrad Hinsen
Categories: Programming, Science

The last two days I participated in the “Executable Papers workshop” at this year’s ICCS conference. It was not just another workshop among the many ICCS workshops. The participants had all submitted a proposal to the “Executable Paper Grand Challenge” run by Elsevier, one of the biggest scientific publishers. On the first day, the nine finalists presented their work, and on the second day, the remaining accepted proposals were presented.

The term “executable papers” stands for the expected next revolution in scientific publishing. The move from printed journals to electronic on-line journals (or a combination of both) has changed little for authors and readers. It is the libraries that have seen the largest impact because they now do little more than paying subscription fees. Readers obtain papers as PDF files directly from the publishers’ Web sites. The one change that does matter to scientists is that most journals now propose the distribute “supplementary material” in addition to the main paper. This can in principle be any kind of file, in practice it is mostly used for additional explanations, images, and tables, i.e. to keep the main paper shorter. Occasionally there are also videos, a first step towards exploring the new possibilities opened up by electronic distribution. The step to executable papers is a much bigger one: the goal is to integrate computer-readable data and executable program code together with the text part of a paper. The goals are a richer reader experience (e.g. interactive visualizations), verifiability of results by both referees and readers (by re-running part of the computations described in the paper), and re-use of data and code in later work by the same or other authors. There is some overlap in these goals with the “reproducible research” movement, whose goal is to make computational research reproducible by providing tools and methods that permit to store a trace of everything that entered into some computational procedure (input data, program code, description of the computing environment) such that someone else (or even the original author a month later) can re-run everything and obtain the same results. The new aspect in executable papers is the packaging and distribution of everything, as well as the handling of bibliographic references.

The proposals’ variety mostly reflected the different background of the presenters. A mathematician documenting proofs obviously has different needs than an astrophysicist simulating a supernova on a supercomputer. Unfortunately this important aspect was never explicitly discussed. Most presenters did not even mention their field of work, much less what it implies in terms of data handling. This was probably due to the enormous time pressure; 15 to 20 minutes for a presentation plus demonstration of a complex tool was clearly not enough.

The proposals could roughly be grouped into three categories:

  • Web-based tools that permit the author to compose his executable paper by supplying data, code, and text, and permit the reviewer and reader to consult this material and re-run computations.
  • Systems for preserving the author’s computational environment in order to permit reviewers and readers to use the author’s software with little effort and without any security risks.
  • Semantic markup systems that make parts of the written text interpretable by a computer for various kinds of processing

Some proposals covered two of these categories but with a clear emphasis on one of them. For the details of each propsal, see the ICCS proceedings which are freely available.

While it was interesting to see all the different ideas presented, my main impression of the Executable Paper Workshop is that of a missed opportunity. Having all those people who had thought long and hard about the various issues in one room for two days would have been a unique occasion to make progress towards better tools for the future. In fact, none of the solutions presented cover the needs of the all the domains of computational science. They make assumptions about the nature of the data and the code that are not universally valid. One or two hours of discussion might have helped a lot to improve everyone’s tools.

The implementation of my own proposal, which addresses the questions of how to store code and data in a flexible, efficient, and future-proof way, is available here. It contains a multi-platform binary (MacOS, Linux, Windows, all on the x86 platform) and requires version 6 of the Java Runtime Environment. The source code is also included, but there is no build system at the moment (I use a collection of scripts that have my home-directory hard-coded in lots of places). There is, however, a tutorial. Feedback is welcome!