Teaching parallel computing in Python

Every time I teach a class on parallel computing with Python using the multiprocessing module, I wonder if multiprocessing is really mature enough that I should recommend using it. I end up deciding for it, mostly because of the lack of better alternatives. But I am not happy at all with some features of multiprocessing, which are particularly nasty for non-experts in Python. That category typically includes everyone in my classes.

To illustrate the problem, I’ll start with a simple example script, the kind of example you put on a slide to start explaining how parallel computing works:

from multiprocessing import Pool
import numpy
pool = Pool()
print pool.map(numpy.sqrt, range(100))

Do you see the two bugs in this example? Look again. No, it’s nothing trivial such as a missing comma or inverted arguments in a function call. This is code that I would actually expect to work. But it doesn’t.

Imagine your typical student typing this script and running it. Here’s what happens:

Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
UnpicklingError: NEWOBJ class argument has NULL tp_new

Python experts will immediately see what’s wrong: numpy.sqrt is not picklable. This is mostly an historical accident. Nothing makes it impossible or even difficult to pickle C functions such as numpy.sqrt, but since pickling was invented and implemented long before parallel computing, at a time when pickling functions was pretty pointless, so it’s not possible. Implementing it today within the framework of Python’s existing pickle protocol is unfortunately not trivial, and that’s why it hasn’t been implemented.

Now try to explain this to non-experts who have basic Python knowledge and want to do parallel computing. It doesn’t hurt of course if they learn a bit about pickling, since it also has a performance impact on parallel programs. But due to restrictions such as this one, you have to explain this right at the start, although it would be better to leave this for the “advanced topics” part.

OK, you have passed the message, and your students fix the script:

from multiprocessing import Pool
import numpy

pool = Pool()

def square_root(x):
    return numpy.sqrt(x)

print pool.map(square_root, range(100))

And then run it:

Process PoolWorker-1:
Traceback (most recent call last):
Process PoolWorker-2:
Traceback (most recent call last):
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
 self.run()
 self.run()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
 self._target(*self._args, **self._kwargs)
 self._target(*self._args, **self._kwargs)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 57, in worker
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'
 task = get()
 File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/queues.py", line 352, in get
 return recv()
AttributeError: 'module' object has no attribute 'square_root'

At this point, even many Python experts would start scratching their heads. In order to understand what is going on, you have to know how multiprocessing creates its processor pools. And since the answer (on Unix systems) is “fork”, you have to have a pretty good idea of Unix process creation to see the cause of the error. Which then allows to find a trivial fix:

from multiprocessing import Pool
import numpy

def square_root(x):
    return numpy.sqrt(x)

pool = Pool()

print pool.map(square_root, range(100))

Success! It works! But… how do you explain this to your students?

To make it worse, this script works but is still not correct: it has a portability bug because it doesn’t work under Windows. So you add a section on Windows process management to the section on Unix process management. In the end, you have spent more time explaining the implementation restrictions in multiprocessing than how to use it. A great way to reinforce the popular belief that parallel computing is for experts only.

These issues with multiprocessing are a classical case of a leaky abstraction: multiprocessing provides a “pool of worker processes” abstraction to the programmer, but in order to use it, the programmer has to understand the implementation. In my opinion, it would be preferable to have a less shiny API, but one which reflects the implementation restrictions. The pickle limitations might well go away one day (see PEP 3154, for example), but until this really happens, I’d prefer an API that does not suggest possibilities that don’t exist.

I have actually thought about this myself a long time ago, when designing the API of my own parallel computing framework for Python (which differs from multiprocessing in being designed for distributed-memory machines). I ended up with an API that forces all functions that implement tasks executed in parallel to be methods of a single class, or functions of a single module. My API also contains an explicit “run parallel job now” call at the end. This is certainly less elegant than the multiprocessing API, but it actually works as expected.

Explore posts in the same categories: Programming

This entry was posted on February 6, 2012 at 13:40 and is filed under Programming. You can subscribe via RSS 2.0 feed to this post's comments. You can comment below, or link to this permanent URL from your own site.

8 Comments on “Teaching parallel computing in Python”

Software Carpentry » Why We Don’t Teach Parallel Computing in Software Carpentry Says:

February 7, 2012 at 04:52
[…] Hinsen recently wrote a blog post that explains why teaching parallel computing with Python is hard. To make a long story short, Python’s multiprocessing module can fail on simple problems in a […]

Reply
Hari Koduvely Says:

November 10, 2012 at 05:23
Python seems to be lagging behind in parallel (distributed) computing using MapReduce framework. Though in principle it exist even in Amazon EC2, I don’t see many developers using it unlike Mahout written in Java. Any thoughts Konrad?

Reply
- khinsen Says:
  
  November 10, 2012 at 18:51
  Distributed parallel computing is a different category. Actually, it’s at least two categories: one contains the tightly coupled and rather homogeneous computations typical of scientific computing (think of climate models or biomolecular simulation), which has lots of computation but little input data, and the other contains MapReduce-style massive data analysis. Python is doing rather well in the first category, even though mostly in the role of steering and supervising simulations whose core is written in C or Fortran. As you notice, Python is not very present in the data crunching world. I don’t see any obvious technical reason, so it’s perhaps just a matter of the data crunching community coming dominantly from a Java background.
  
  Reply
Ramon Crehuet Says:

January 8, 2014 at 14:43
Very iluminating post. I was completely puzzled by the last change of behaviour. My knowledge of forks is obviously insufficient, because I don’t see how the order affects in the creation of the pool object and the function…
Could you point me a resource that explains this?

Reply
- khinsen Says:
  
  January 8, 2014 at 15:26
  A Unix ‘fork’ creates a complete copy of the running process. After the fork, there are two processes that have the same memory contents and continue execution at the same point in the code.
  
  When you create an instance of Pool(), the multiprocessing module creates additional copies of Python using fork. All copies share the state that the original process had when it reached the pool creation point. Only the master process executes the code after Pool(), whereas the slave processes enter a loop waiting for tasks to execute.
  
  In the original example script, the function square_root is created after the pool creation, meaning that it exists only in the master process. In the corrected version, it is created before pool creation, and thus exists in all the slave processes as well.
  
  This means that you have to consider pool creation as a clear borderline in your program: all definitions used in the computation must happen before, whereas the creation of the computational tasks happens later.
  
  Reply
  - Ramon Crehuet Says:
    
    January 8, 2014 at 15:32
    Thanks! Indeed, the documentation in http://docs.python.org/dev/library/multiprocessing.html is extensive but not very clear for the novice. Your explanation is much more enlightening.
  - Joshua Stough Says:
    
    May 14, 2014 at 22:26
    This is a great post and I was happy to see it. Thank you.
    One note though: it seems that is just good coding practice to make sure your ‘parallel stuff’ is contiguous and occurs only after dependencies have been defined.
  - khinsen Says:
    
    May 15, 2014 at 08:28
    One can argue for this being good practice, but should non-respect be punished by incomprehensible error messages?