Archive for February 2012

Teaching parallel computing in Python

February 6, 2012

Every time I teach a class on parallel computing with Python using the multiprocessing module, I wonder if multiprocessing is really mature enough that I should recommend using it. I end up deciding for it, mostly because of the lack of better alternatives. But I am not happy at all with some features of multiprocessing, which are particularly nasty for non-experts in Python. That category typically includes everyone in my classes.

To illustrate the problem, I’ll start with a simple example script, the kind of example you put on a slide to start explaining how parallel computing works:

from multiprocessing import Pool
import numpy
pool = Pool()
print pool.map(numpy.sqrt, range(100))

Do you see the two bugs in this example? Look again. No, it’s nothing trivial such as a missing comma or inverted arguments in a function call. This is code that I would actually expect to work. But it doesn’t.

Imagine your typical student typing this script and running it. Here’s what happens:

Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new

Python experts will immediately see what’s wrong: numpy.sqrt is not picklable. This is mostly an historical accident. Nothing makes it impossible or even difficult to pickle C functions such as numpy.sqrt, but since pickling was invented and implemented long before parallel computing, at a time when pickling functions was pretty pointless, so it’s not possible. Implementing it today within the framework of Python’s existing pickle protocol is unfortunately not trivial, and that’s why it hasn’t been implemented.

Now try to explain this to non-experts who have basic Python knowledge and want to do parallel computing. It doesn’t hurt of course if they learn a bit about pickling, since it also has a performance impact on parallel programs. But due to restrictions such as this one, you have to explain this right at the start, although it would be better to leave this for the “advanced topics” part.

OK, you have passed the message, and your students fix the script:

from multiprocessing import Pool
import numpy

pool = Pool()

def square_root(x):
    return numpy.sqrt(x)

print pool.map(square_root, range(100))

And then run it:

Process PoolWorker-1:
Traceback (most recent call last):
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'

At this point, even many Python experts would start scratching their heads. In order to understand what is going on, you have to know how multiprocessing creates its processor pools. And since the answer (on Unix systems) is “fork”, you have to have a pretty good idea of Unix process creation to see the cause of the error. Which then allows to find a trivial fix:

from multiprocessing import Pool
import numpy

def square_root(x):
    return numpy.sqrt(x)

pool = Pool()

print pool.map(square_root, range(100))

Success! It works! But… how do you explain this to your students?

To make it worse, this script works but is still not correct: it has a portability bug because it doesn’t work under Windows. So you add a section on Windows process management to the section on Unix process management. In the end, you have spent more time explaining the implementation restrictions in multiprocessing than how to use it. A great way to reinforce the popular belief that parallel computing is for experts only.

These issues with multiprocessing are a classical case of a leaky abstraction: multiprocessing provides a “pool of worker processes” abstraction to the programmer, but in order to use it, the programmer has to understand the implementation. In my opinion, it would be preferable to have a less shiny API, but one which reflects the implementation restrictions. The pickle limitations might well go away one day (see PEP 3154, for example), but until this really happens, I’d prefer an API that does not suggest possibilities that don’t exist.

I have actually thought about this myself a long time ago, when designing the API of my own parallel computing framework for Python (which differs from multiprocessing in being designed for distributed-memory machines). I ended up with an API that forces all functions that implement tasks executed in parallel to be methods of a single class, or functions of a single module. My API also contains an explicit “run parallel job now” call at the end. This is certainly less elegant than the multiprocessing API, but it actually works as expected.