As everybody knows, the current implementation of CPython is pretty much incapable of doing any serious multithreaded computation, because of the GIL. Python's most current solution to this problem is using separate processes instead of threads - in most of the cases, this actually suffices.
A problem I'm trying to solve at work now is embarrassingly parallel. It was natural to try to make use of my quad-core Xeon desktop to decrease my 1-hr runtime down to 15 minutes; however, the programs use inputs from a vast (4+GB) numpy array which is common to all processes. Sending an array of this size through IPC is simply ridiculous. Yes, there are ways to 'share' a single copy of numpy array across processes - but it required me to make drastic changes to the program's structure. I tried numpy.memmap which is essentially an array using memory-mapped files. Single-thread performance was degraded 10x... dude, I was just trying to achieve 3+x speedup. Is this too much???
What I actually ended up doing is writing performance-critical piece of code in C++, using scipy.weave (it basically lets us to inline C++ code in Python, compiling the C++ code as a Python package). Yes, it's not pleasant, but when the C++ code snippet is small, it's doable. Yes, it's not a game-changer - it doesn't allow true multithreading. However: weave is compiled as a C Python module - and they can release the GIL if a macro is used: Py_BEGIN_ALLOW_THREADS.
So, simply adding the two macros at the beginning and the end of the inlined code can make it (almost) truly multithreading.
I did a simple experiment to verify this.
I wrote a program that keeps an array of 1 million integers in memory. I wrote a function that given n, scan the array to find all the integers which are less than or equal to n, and return the sum of number of divisors of such integers. (Yes, it's contrived, but it closely resembles what I'm doing right now.)
For example, suppose the array is [4,7,6,8,11,9] and n is 7. There are 3 numbers less than or equal to 7; and they respectively have 3, 2, 4 divisors. The function will return 3+2+4 = 9.
So, the code follows.
lang:py
import numpy as np
from scipy import weave
from random import randint
from handythread import *
import time
HT = 100*1000 # hundread thousand
inputs = [HT*i for i in [1,4,1,2,4,5,1,1,2,3,4,1,2]]
m = 1000000
common = np.array([randint(0,10*HT) for i in xrange(m)])
code = """
int ret = 0;
for(int i = 0; i < m; ++i)
{
if(common[i] <= n)
{
++ret; // 1
if(common[i] > 1) ++ret; // common[i]
for(int j = 2; j*j <= common[i]; ++j)
{
if(common[i] % j == 0)
{
++ret;
if(j*j < common[i])
++ret;
}
}
}
}
return_val = ret;
"""
def count_divisors_under(n):
return weave.inline(code, ["n", "common", "m"])
def count_divisors_under_release(n):
return weave.inline("Py_BEGIN_ALLOW_THREADS\n" + code + "Py_END_ALLOW_THREADS\n", ["n", "common", "m"])
st = time.time()
print map(count_divisors_under, inputs)
print "Singlethread %.2lf" % (time.time() - st)
st = time.time()
print parallel_map(count_divisors_under, inputs, threads=4)
print "Naive Multithread %.2lf" % (time.time() - st)
st = time.time()
print parallel_map(count_divisors_under_release, inputs, threads=4)
print "ReleaseGIL Multithread %.2lf" % (time.time() - st)
I used handythread library from this cookbook recipe for easier parallelization. It's nothing fancy; it's just a less than a hundred lines of code.
Anyway, the results:
[1177114, 5228072, 1177114, .... 2475507]
Singlethread 4.78
[1177114, 5228072, 1177114, .... 2475507]
Naive Multithread 4.77
[1177114, 5228072, 1177114, .... 2475507]
ReleaseGIL Multithread 1.54
Hooray, 3x speedup! My machine is an i7, which is a quad core. (Don't ask me why it's not 4x... I don't know.)


