Parallel Python in classes…now you are in a pickle

In the past, I discussed how to create a python script which runs your calculations in parallel.  Using the multiprocessing library, you can circumvent the GIL and employing the async version of the multiprocessing functions, calculations are even performed in parallel. This works quite well, however, when using this within a python class you may run into some unexpected behaviour and errors due to the pickling performed by the multiprocessing library.

For example, if the doOneRun function is a class function defined as

class MyClass:
...
    def doOneRun(self, id:int):
       return id**3
...

and you perform some parallel calculation in another function of your class as

class MyClass:
...
    def ParallelF(self, NRuns:int):
       import multiprocessing as mp

       nproc=10
       pool=mp.Pool(processes=nprocs) 
       drones=[pool.apply_async(self.doOneRun, args=(nr,)) for nr in range(NRuns)] 

       for drone in drones: 
           Results.collectData(drone.get()) 
       pool.close() 
       pool.join() 
       
...

you may run into a runtime error complaining that a function totally unrelated to the parallel work (or even to the class itself) can not be pickled. 😯

So what is going on? In the above setup, you would expect the pool.apply_async function to take just a function pointer to the doOneRun function. However, as it is provided by a the call self.doOneRun, the pool-function grabs the entire class and everything it contains, and tries to pickle it to distribute it to all the processes.  In addition to the fact that such an approach is hugely inefficient, it has the side-effect that any part associated to your class needs to be pickleable, even if it is a class-function of a class used to generate an object which is just a property of the MyClass Class above.

So both for reasons of efficiency and to avoid such side-effects, it is best to make the doOneRun function independent of a class, and even placing it outside the class.

def doOneRun(id:int):
    return id**3
  
class MyClass:
...
    def ParallelF(self, NRuns:int):
       import multiprocessing as mp

       nproc=10
       pool=mp.Pool(processes=nprocs) 
       drones=[pool.apply_async(doOneRun, args=nr) for nr in range(NRuns)] 

       for drone in drones: 
           Results.collectData(drone.get()) 
       pool.close() 
       pool.join() 
       
...

This way you avoid pickling the entire class, reducing initialization times of the processes and the  unnecessary communication-overhead between processes. As a bonus, you also reduce the risk of unexpected crashes unrelated to the calculation performed.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.