Archive for the ‘Uncategorized’ category

This blog is moving!

November 12, 2015

Welcome to the last post on this WordPress blog. I have set up a new blog for all my future writing.

The reason for the move is that the user interface at WordPress is changing all the time without ever getting better. I like to write my posts on my own computer using Emacs, rather than typing into a rudimentary editing window on a Web site. This is not completely impossible with WordPress, but more hassle than it’s worth.

My new blog is hosted on GitHub and powered by Frog, a static Web site generator that mixes my posts written as plain Markdown files with HTML templates based on the Bootstrap framework to produce the pages you can read. This setup gives me much more control over my blog, while at the same time making it easier for me to publish new posts.

The one feature that will disappear is the possibility to subscribe to my blog in order to be informed about new posts by e-mail. If you have a GitHub account, you can get the same effect by following updates to the repository that contains my blog. But the easiest way to learn about new posts is to follow me on Twitter.

Advertisements

The state of NumPy

September 12, 2014

The release of NumPy 1.9 a few days ago was a bit of a revelation for me. For the first time in the combined history of NumPy and its predecessor Numeric, a new release broke my own code so severely thatI don’t see any obvious way to fix it, given the limited means I can dedicate to software maintenance. And that makes me wonder for which scientific uses today’s Python ecosystem can still be recommended, since the lack of means for code maintenance is a chronic and endemic problem in science.

I’ll start with a historical review, for which I am particularly well placed as one of the oldtimers in the community: I was a founding member of the Matrix-SIG, a small group of scientists who in 1995 set out to use the still young Python language for computational science, starting with the design and implementation of a module called Numeric. Back then Python was a minority language in a field dominated by Fortran. The number of users started to grow seriously from 2000, to the point of now being a well-recognized and respected community that spans all domains of scientific research and holds several
conferences per year across the globe. The combination of technological change and the needs of new users has caused regular changes in the code base, which has grown as significantly as the user base: the first releases were small packages written and maintained by a single person (Jim Hugunin, who later became famous for Jython and IronPython), whereas today’s NumPy is a complex beast maintained by a team.

My oldest published Python packages, ScientificPython and MMTK, go back to 1997 and are still widely used. They underwent a single major code reorganization, from module collections to packages when Python 1.5 introduced the package system. Other than that, most of the changes to the code base were implementations of new features and the inevitable bug fixes. The two main dependencies of my code, NumPy and Python itself, did sometimes introduce incompatible changes (by design or as consequences of bug fixes) that required changes on my own code base, but they were surprisingly minor and never required more than about a day of work.

However, I now realize that I have simply been lucky. While Python and its standard library have indeed been very stable (not counting the transition to Python 3), NumPy has introduced incompatible changes with almost every new version over the last years. None of them ever touched functionalities that I was using, so I barely noticed them when looking at each new version’s release notes. That changed with release 1.9, which removes the compatbility layer with the old Numeric package, on which all of my code relies because of its early origins.

Backwards-incompatible changes are of course nothing exceptional in the computing world. User needs change, new ideas permit improvements, but existing APIs often prevent a clean or efficient implementation of new features or fundamental code redesigns. This is particularly true for APIs that are not the result of careful design, but of organic growth, which is the case for almost all scientific software. As a result, there is always a tension between improving a piece of software and keeping it compatible with code that depends on it. Several strategies have emerged to deal with, depending on the priorities of each community. The point I want to make in this post is that NumPy has made a bad choice, for several reasons.

The NumPy attitude can be summarized as “introduce incompatible changes slowly but continuously”. Every change goes through several stages. First, the intention of an upcoming changes is announced. Next, deprecation warnings are added in the code, which are printed when code relying on the soon-to-disappear feature is executed. Finally, the change becomes effective. Sometimes changes are made in several steps to ease the transition. A good example from the 1.9 release notes is this:

In NumPy 1.8, the diagonal and diag functions returned readonly copies, in NumPy 1.9 they return readonly views, and in 1.10 they
will return writeable views.

The idea behind this approach to change is that client code that depends on NumPy is expected to be adapted continuously. The early warnings and the slow but regular rythm of change help developers of client code to keep up with NumPy.

The main problem with this attitude is that it works only under the assumption that client code is actively maintained. In scientific computing, that’s not a reasonable assumption to make. Anyone who has followed the discussions about the scientific software crisis and the lack of reproduciblity in computational science should be well aware of this point that is frequently made. Much if not most scientific code is written by individuals or small teams for a specific study and then modified only as much as strictly required. One step up on the maintenance ladder, there is scientific code that is published and maintained by computational scientists as a side activity, without any significant means attributed to software development, usually because the work is not sufficiently valued by funding agencies. This is the category that my own libraries belong to. Of course the most visible software packages are those that are actively maintained by a sufficiently strong community, but I doubt they are representative for computational science as a whole.

A secondary problem with the “slow continuous change” philosophy is that client code becomes hard to read and understand. If you get a Python script, say as a reviewer for a submitted article, and see “import numpy”, you don’t know which version of numpy the authors had in mind. If that script calls array.diag() and modifies the return value, does it expect to modify a copy or a view? The result is very different, but there is no way to tell. It is possible, even quite probable, that the code would execute fine with both NumPy 1.8 and the upcoming NumPy 1.10, but yield different results.

Given the importance of NumPy in the scientific Python ecosystem – the majority of scientific libraries and applications depends on it -, I consider its lack of stability alarming. I would much prefer the NumPy developers to adopt the attitude to change taken by the Python language itself: accumulate ideas for incompatible changes, and apply them in a new version that is clearly labelled and announced as incompatible. Everyone in the Python community knows that there are important differences between Python 2 and Python 3. There’s a good chance that a scientist publishing a Python script will clearly say if it’s for Python 2 or Python 3, but even if not, the answer is often evident from looking at the code, because at least some of the many differences will be visible.

As for my initial question for which scientific uses today’s Python ecosystem can still be recommended, I hesitate to provide an answer. Today’s scientific Python ecosystem is not stable enough for use in small-scale science, in my opinion, although it remains an excellent choice for big communities that can somehow find the resources to maintain their code. What makes me hesitate to recommend not using Python is that there is no better alternative. The only widely used scientific programming language that can be considered stable, but anyone who has used Python is unlikely to be willing to switch to an environment with tedious edit-compile-run cycles.

One possible solution would be a long-time-support version of the core libraries of the Python ecosystem, maintained without any functional change by a separate development team. But that development team has be created and funded. Any volunteers?

A first experience with Open Access publishing

July 4, 2014

Most scientists have found out by now that a lot has been going wrong with scientific publishing over the years. In many fields, scientific journals are no longer fulfilling what used to be their primary role: disseminating and archiving the results of scientific studies. One of the new approaches that were developed to fix the publishing system is Open Access: the principle that published articles should be freely accessible to everyone (under conditions that vary according to which “dialect” of Open Access is used) and that the cost of the publishing procedure should be payed in some other way than subscription fees. The universe of Open Access publishing has become quite complex in itself. For those who want to know more about it, a good starting point is this book, whose electronic form is, of course, Open Access.

While I have been following the developments in Open Access publishing for a few years, I had never published any Open Access article myself. I work at the borderline of theoretical physics and biophysics, which sounds like closely related fields but they nevertheless have very different publishing traditions. In theoretical physics, the most well-known journals are produced by non-commercial publishers, in particular scientific societies. Their prices have not exploded, nor do these publishers put pressure on libraries to subscribe to more than they want to. There is a also a strong tradition of making preprints freely available, e.g. on arXiv.org. This combined model continues to work well for theoretical physics, meaning that there is little incentive to look at Open Access publishing models. However, as soon as the “bio” prefix comes into play, the main journals are commercial. Some offer a per-article Open Access option, in exchange for the authors paying a few hundred to a few thousand dollars per article. There are also pure Open Access journals covering this field (e.g. PLOS Computational Biology), whose price range is similar. On the scale of the working budget of a theoretician working in France, these publishing fees are way too high, which is why I never considered Open Access for my “applied” research.

The fact that I have recently published my first Open Access article, in the pure Open Access journal F1000Research, is almost a bit accidental. The topic of the article is the role of computation in science, with a particular emphasis on the necessity to keep scientific models distinct from software tools. I had the plan to write such an artile for a while, but it didn’t really fit into any of the journals I knew. The subject is computational science, but more its philosophical foundations than the technicalities that journals on computational science specialize in. The audience is scientists applying computations, which is a much larger group than the methodology specialists who subscribe to and read computational science journals. Even if some computational science journal might have accepted my article, it wouldn’t have reached most of its intended audience. A journal on the philosphy of science would have been worse, as almost no practitioner of computational science looks at this literature. Since there was no clear venue where the intended audience would have a chance of finding my article, the best option was some Open Access journal where at least the article would be accessible to everyone. Publicity through social networks could then help potentially interested readers discover it. Two obstacles remained: finding an Open Access journal with a suitable subject domain, and getting around the money problem.

At the January 2014 Community Call of the Mozilla Science Lab, I learned that F1000Research was starting a new section on “science communication”, and was waiving article processing charges for that section in 2014. This was confirmed shortly thereafter on the journal’s blog. Science communication was in fact a very good label for what I wanted to write about. And F1000Research looked like an interesting journal to test because its attitude to openness goes beyond Open Access: the review process is open as well, meaning that reviews are published with the reviewers’ names, and get their own DOI for reference. So there was my opportunity.

For those new to the Open Access world, I will give a quick overview of the submission and publishing process. Everything is handled online, through the journal’s Web site and by e-mail. Since I very much prefer writing LaTeX to using Word, I chose the option of submitting through the writeLaTeX service. The idea of writeLaTeX is that you edit your article using their Web tools, but nothing stops you from downloading the template provided by F1000Research, writing locally, and uploading the final text in the end. I thus wrote my article using my preferred tool (Emacs) and on my laptop even when I didn’t have a network connection. Once you submit your article, it is revised by the editorial staff (concerning language, style, and layout, they don’t touch the contents). Once you approve the revision, the article is published almost instantaneously on the journal Web site. You are then asked to suggest reviewers, and the journal asks some of them (I don’t know how they make their choice) to review the article. Reviews are published as they come in, and you get an e-mail alert. In addition to providing detailed comments, reviewers judge the article as “approved”, “approved with reservations” or “not approved”. As soon as two reviewers “approve”, the article status changes to “indexed”, meaning that it gets a DOI and it is listed in databases such as PubMed or Scopus. Authors can reply to reviewers (again in public), and they are encouraged to revise their article based on the reviewers’ suggestions. All versions of an article remain accesible indefinitely on the journal’s Web site, so the history of the article remains accessible forever.

Overall I would judge my experience with F1000Research as very positive. The editorial staff replies rapidly and gets problems solved (in my case, technical problems with the Web site). Open review is much more reasonable than the traditional secret peer review process. No more guessing who the reviewers are in order to please them with citations with the hope of getting your revision accepted rapidly. No more lengthy letters to the editor trying to explain diplomatically that the reviewer is incompetent. With open reviewing, authors and reviewers act as equals, as it should always have been.

The only criticism I have concerns a technical point that I hope will be improved in the future. Even if you submit your original article through writeLaTeX, you have to prepapre revisions using Microsoft Word: you download a Word file for the initially published version, activate “track changes” mode, make your changes, and send the file back. For someone who doesn’t have Microsoft Word, or is not familiar with its operation, this is an enormous barrier. A journal that encourages authors to revise their articles should also allow them to do so using tools that they have and are familiar with.

Will I publish in F1000Research again? I don’t expect to do so in the near future. With the exception of the science communication section, F1000Research is heavily oriented towards the life sciences, so most of my research doesn’t fit in. And then there is the money problem. Without the waiver mentioned above, I’d have had to pay 500 USD for my manuscript classified as an “opinion article”. Regular research articles are twice as much. Compared to a theoretician’s budget, which needs to cover mostly travel, these amounts are important. Moreover, in France’s heavily bureaucratized public research, every euro comes with strings attached that define when, where, and on what you are allowed to spend it. Project-specific research grants often do allow to pay publication costs, but research outside of such projects, which is still common in the theoretical sciences, doesn’t have any specific budget to turn to. The idea of the Open Access movement is to re-orient the money currently spent on subscriptions towards paying publishing costs directly, but such decisions are made on a political and administrational level very remote from my daily work. Until they happen, it is rather unlikely that I will publish in Open Access mode again.

Bye bye Address Book, welcome BBDB

June 3, 2013

About two years ago I wrote a post about why and how I abandoned Apple’s iCal for my agenda management and moved to Emacs org-mode instead. Now I am in the process of making the second step in the same direction: I am abandoning Apple’s Address Book and starting to use the “Big Brother DataBase“, the most popular contact management system from the Emacs universe.

What started to annoy me seriously about Address Book is a bug that makes the database and its backups grow over time, even if no contacts are added, because the images for the contacts keep getting copied and never deleted under certain circumstances. I ended up having address book backups of 200 MB for just 500 contacts, which is ridiculous. A quick Web search shows that the problem has been known for years but has not yet been fixed.

When I upgraded from MacOS 10.6 to 10.7 about a year ago (I am certainly not an early adopter of new MacOS versions), I had a second reason to dislike Address Book: the user interface had been completely re-designed and become a mess in the process. Every time I use it I have to figure out again how to navigate groups and contacts.

I had been considering moving to BBDB for a while, but I hadn’t found any good solution for synchronizing contacts with my Android phone. That changed when I discovered ASynK, which does a bi-directional synchronization between a BBDB database and a Google Contacts account. That setup actually works better than anything I ever tried to synchronize Address Book with Google Contacts, so I gained more than I expected in the transition.

At first glance, it may seem weird to move from technology of the 2000’s to technology of the 1970’s. But the progress over that period in managing rather simple data such as contact information has been negligible. The big advantage of the Emacs platform over the MacOS platform is that it doesn’t try to take control over my data. A BBDB database is just a plain text file whose structure is apparent after five minutes of study, whereas an Address Book database is stored in a proprietary format. A second advantage is that the Emacs developer community fixes bugs a lot faster than Apple does. A less shiny (but perfectly usable) user interface is a small price to pay.

Integrating scientific software and datasets into the citation record

November 14, 2012

This morning I read C. Titus Brown’s blog post on how science could be so much better if scientitic data and the software used to work with it were openly available for reuse. One problem he mentions, like many others have done before, is the lack of incentive for publishing anything else but standard scientific papers. What matters for a scientist’s career and for grant applications is papers, papers, papers. Any contribution that’s not in a scientific journal with a reputation and an impact factor is usually ignored, even if its real impact exceeds that of many papers that nobody really wants to read.

Ideally, published scientific data and software should be treated just like a paper: it should be citeable and it should appear in the citation databases that are used to calculate impact factors, h factors, and whatever other metrics bibliometrists come up with and evaluation committees appreciate for their ease of use.

Treating text (i.e. papers), data, and code identically also happens to be useful for making scientific publications more useful to the reader, by adding interactive visualization and exploration of procedures (such as varying parameters) to the static presentation of results in a standard paper. This idea of “executable papers” has generated a lot of interest recently, as shown by Elsevier’s Executable Paper Challenge and the Beyond the PDF workshop. For a technical description of how this can be achieved, see my ActivePapers project and/or the paper describing it. In the ActivePapers framework, a reference to code being called, or to a dataset being reused, is exactly identical to a reference to a published paper. It would then be much easier for citation databases to include all references rather than filter out the ones that are “classical” citations. And that’s a good motivation to finally treat all scientific contributions equally.

Since the executable papers idea is much easier to sell than the idea of an upated incentive system, a seemingly innocent choice in technology could end up helping to change the way scientists and research projects are evaluated.

A rant about mail clients

November 4, 2011

A while ago I described why migrated my agendas from iCal to orgmode. To sum it up, my main motivation was to gain more freedom in managing my information: where iCal imposes a rigid format for events and insists on storing them in its own database, inaccessible to other programs, orgmode lets me mix agenda information with whatever else I like in plain text files. Today’s story is a similar one, but without the happy end. I am as much fed up with mail clients as I was with iCal, and for much the same reasons, but I haven’t yet found anything I could migrate to.

From an information processing point of view, an e-mail message is not very different from lots of other pieces of data. It’s a sequence of bytes respecting a specific format (defined by a handful of standards) to allow its unambiguous interpretation by various programs in the processing chain. An e-mail message can perfectly well be stored in a file and in fact most e-mail clients permit saving a message to a file. Unfortunately, the number of e-mail clients able to open and display correctly such a file is already much smaller. But when it comes to collections of messages, information processing freedom ends completely.

Pretty much every mail client’s point of view is that all of a user’s mail is stored in some database, and that it (the client) is free to handle this database in whatever way it likes. The user’s only access to the messages is the mail client. The one and only. The only exception is server-based mail databases handled via the IMAP protocol, where multiple clients can work with a common database. If you don’t use IMAP, you have no control over how and where your mail is stored, who has access to it, etc.

What I’d like to do is manage mail just like I manage other files. A mailbox should just be a directory containing messages, one per file. Mailboxes could be stored anywhere in the file system. Mailboxes could be shared through the file system, and backed up via the file system. They could be grouped with whatever other information in whatever way that suits me. I would double-click on a message to view it, or double-click on a mailbox directory to view a summary, sorted in the way I like it. Or I would use command-line tools to work on a message or a mailbox. I’d pick the best tool for each job, just like I do when working with any other kind of file.

Why all that isn’t possible remains a mystery to me. The technology has been around for decades. The good old Maildir format would be just fine for storing mailboxes anywhere in the file system, as would the even more venerable mbox format. But even mail clients that use mbox or Maildir internally insist that all such mailboxes must reside in a single master directory. Moreover, they won’t let me open a mailbox from outside, I have to run the mail client and work through its hierarchical presentation of mailboxes to get to my destination.

Before I get inundated by comments pointing out that mail client X has feature Y from the list above: Yes, I know, there are small exceptions here and there. But unless I have the complete freedom to put my mail where I want it, the isolated feature won’t do me much good. If someone knows of a mail client that has all the features I am asking for, plus the features we all expect from a modern mail client, then please do leave a comment!

EuroSciPy 2011

August 30, 2011

Another EuroSciPy conference is over, and like last year it was very interesting. Here is my personal list of highlights and comments.

The two keynote talks were particularly inspiring. On Saturday, Marian Petre reported on her studies of how people in general and scientists in particular develop software. The first part of her presentation was about how “expert” design and implement software, the definition of an expert being someone who produces software that actually works, is finished on time, and doesn’t exceed the planned budget. The second part was about the particularities of software development in science. But perhaps the most memorable quote of the keynote was Marian’s reply to a question from the audience of how to deal with unreasonable decisions coming from technically less competent managers. She recommended to learn how to manage management – a phrase that I heard repeated several times during the discussions along the conference.

The Sunday keynote was given by Fernando Perez. As was to be expected, IPython was his number one topic and there was a lot of new stuff to show off. I won’t mention all the new features in the recently released version 0.11 because they are already discussed in detail elsewhere. What I find even more exciting is the new Web notebook interface, available only directly from the development site at github. A notebook is an editable trace of an interactive session that can be edited, saved, stored in a repository, or shared with others. It contains inputs and outputs of all commands. Inputs are cells that can consist of more than one line. Outputs are by default what Python prints to the terminal, but IPython provides a mechanism for displaying specific types of objects in a special way. This allows to show images (in particular plots) inline, but also to turn SymPy expressions into mathematical formulas typeset in LaTeX.

A more alarming aspect of Fernando’s keynote was his statistical analysis of contributions to the major scientific libraries of the Python universe. In summary, the central packages are maintained by a grand total of about 25 people in their spare time. This observation caused a lot of debate, centered around how to encourage more people to contribute to this fundamental work.

Among the other presentations, as usual mostly of high quality, the ones that impressed me most were Andrew Straw’s presentation of ROS, the Robot Operating System, Chris Myers’ presentation about SloppyCell, and Yann Le Du’s talk about large-scale machine learning running on a home-made GPU cluster. Not to forget the numerous posters with lots of more interesting stuff.

For the first time, EuroSciPy was complemented by domain-specific satellite meetings. I attended PyPhy, the Python in Physics meeting. Physicists are traditionally rather slow in accepting new technology, but the meeting showed that a lot of high-quality research is based on Python tools today, and that Python has also found its way into physics education at various universities.

Finally, conferences are good also because of what you learn during discussions with other participants. During EuroSciPy, I discovered a new scientific journal called Open Research Computation , which is all about software for scientific research. Scientific software developers regularly complain about the lack of visibility and recognition that their work receives by the scientific community and in particular by evaluation and grant attribution committees. A dedicated journal might just be what we need to improve the situation. I hope this will be a success.