Thursday, December 17, 2009

Python the GIL, unladen swallow, reference counting, and atomic operations.

The unladen swallow project comments on the global interpreter lock and suggests long term dropping reference counting in favour of GC. http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Global_Interpreter_Lock. They are also now favouring an incremental GIL removal from python.

However, confusing getting rid of the GIL, with requiring using garbage collection is wrong. This mini-essay will explain why reference counting is good for python, and why you do not need to get rid of it for multiple CPU systems. Then there will be descriptions of other improvements to python for multiple CPU systems that can be made.

Reference counting has a number of advantages over various GC schemes see http://en.wikipedia.org/wiki/Reference_counting#Advantages_and_disadvantages. As a python programmer, optimizing for the programmer is what is most important... and reference counting makes it easier for the programmer. It makes programs much more deterministic, meaning that reference counting code is much easier to think about in our human heads.

Below are two real world examples of atomic reference counting. This makes reference counting thread safe, by using atomic operations.

Here are two mainstream open source object systems that have implemented it:

QT: from qt 4 released June 2005:
http://doc.trolltech.com/qq/qq14-threading.html#atomicreferencecounting

glib: released august 2005.
http://mail.gnome.org/archives/gnome-announce-list/2005-August/msg00048.html

It is possible to make reference counting thread safe in a performant, cross platform manner. Using the argument against reference counting just because of the GIL and thread safeness is a bad argument - as proved by at least these two highly portable, highly performant, open source pieces of code. Not just research papers, but real life code that has been used for a number of years now.

There has been a lot of research and real life code which improves reference counting, which python could take advantage of - rather than moving to a GC.

Simple Direct Media Layer also has a cross platform atomic operations API in SDL 1.3.

Locality of memory is much more important than using GC or reference counting. That is operating on memory worrying about if it is in the caches (L1,L2,L3 etc), also about locality to certain cpus/cores. This is why modern memory allocators have a per thread heap... to help memory allocated on one heap be accessed more easily by threads. See NUMA for details. NUMA factors play a BIG part in modern cpus already... even on single cpu systems. Designing for cache friendliness can give big speedups.

Reference counting can be a problem for cache friendly systems... but it doesn't need to be. There are techniques for reducing, or removing NUMA issues with reference counting memory management.

Reducing memory usage of objects will also mean for faster threaded python programs.
From the pypy project: "The total amount of RAM used on a 32-bit[ED:pypy] Linux is 247 MB, completing in 10.3 seconds. On CPython, it consumes 684 MB and takes 89 seconds to complete... This nicely shows that our GCs are much faster at allocating objects, and that our objects can be much smaller than CPython's."
More details here: http://morepypy.blogspot.com/2009/10/gc-improvements.html. As you can see reducing the memory size of objects can have a massive effect on runtime.

Another memory optimization for python can come from reducing the amount of allocations done by python at run time. Many real time systems have a policy of zero run time memory allocations. For python this is hard, however there are probably a number of places where memory allocations could be removed, or reduced in python. The pypy project tried to reduce the runtime memory allocations and found a number of them which sped things up.

Intelligent sizing of the stack is another memory optimization python could do. Each new thread has a copy of its own stack. So by having a very large stack, python can use less threads than is optimal. An optimization could be to try and use a very small stack, and then resize it if necessary. See the http://docs.python.org/library/thread.html for the thread.stack_size([size]) function.

Also speeding up GIL holding/removing will be good for those parts of python that already remove the GIL - especially on single CPU systems. Currently grabbing the GIL is a fairly costly operation. Using a method like that used by QT with atomic operations would be MUCH more efficient.

An incremental technique of GIL removal was taken by VMs like linux and *bsds (eg freebsd, dragonfly bsd) (and glib, qt). The idea is you make more fine grained locks on individual systems. Then you can slowly remove the GIL from parts incrementally... and you can test them. Removing the GIL all at once is silly... removing it piece by piece is a much better idea. I think the unladen swallow project is now preferring an incremental approach to GIL removal... which is nice.

A plan which approaches GIL removal incrementally is much more sane over all.

Designing APIs to work in manners better for threaded programs is another area python can get better performance on multi cpu systems. For example, by finding calls which block and making sure they release the GIL. Also by making batching APIs, allows not only avoiding the high cost of python function calls, but also makes that code embarrassingly easy to achieve parallelism.

One API that could be easily turned into a batching API is the import command. When you import 10 modules at the top of your program, much of that work could be shared or done in parallel. Likewise making the libraries be able to easily use

Documenting which calls release the GIL, or have fine grained locking/lock free programming can also help programmers more easily take advantage of multiple cpus.

Documenting which objects are safe to pickle will help people use the multiprocessing module, which does not work with objects unless the objects can be pickled. Likewise using forking can allow you to use multiple processes on data that can not be pickled. Since when you fork, the OS does the memory copying for you. The advantage over the multiprocessing library, is that with forking you only need the results to be able to be pickled... not the inputs.

Using wise madvise calls for mmap can also mark data which is read only... making multi processing much more efficient. Using madvise well, can mean much better copy on write behaviour. Improving copy on write behaviour is good for multiple processes... not just multiple threads. Likewise talking to the OS about what the program needs in other ways will also help. For example, if you know your program is going to use 1GB of data... then tell the OS that at the start. If you know it will eventually need 50 threads... then tell it at the start. If you know it needs a very small stack size... then tell it at the start.

Modern CPUs have memory prefetching commands, which can help out with telling the CPU which memory you are likely to access next. Likewise telling python, and the CPU that you want to only read or write to memory can help a lot with performance.

Apart from memory, there are many other OS level tweaks that can make your multi process python code go quicker. For example, setting the process/thread priorities. You can even ask your code to run with real time priority if it needs it. This can reduce the latency of processing, and also speed up the throughput. Easily gaining a 20% increase in throughput, or a very big drop in latency. If you do not care how fast your python program goes, then telling the OS so will help other processes in the system go faster.

Improving the performance of the built in primitives like the Queue/Mutex can also help python a lot. I've found that avoiding the python thread safe Queue can give you pretty good performance increases. Instead just using a list and using its python level atomic operations (documenting which python primitives are atomic would also help programmers a lot). A multi thread safe event queue can be quite nice. Especially if you can pass things into the queue at C level... avoiding the GIL entirely. There are a number of these available for python already (eg pygame.fastevent).

In fact for python 3.2 the GIL has been reworked: http://mail.python.org/pipermail/python-dev/2009-October/093321.html and the RLock primitive has been rewritten in C. This 'newgil' work has already been put into py3k trunk. However I think there is still quite a lot of room for improvement in the python GIL.

Another API design improvement are optimizations available for worker queues. One is sending similar tasks and data to the same cpus/cores... to take advantage of NUMA effects/cache (eg, always sending the url "/dojob1" to the same cpu).

Another worker queue optimisation is to reduce contention on the worker queue, by sending batches of tasks/data to the worker threads/processes. Rather than one task/piece of data at a time... divide the work up into many pieces and then send that across to the worker queues. eg, sending a list of 1000 images to each of the cpus to process rather than have each cpu ask for a single image to process. Dividing the work up into 16 parts of 100 each requires 16 interactions with the worker queue. Having them take one thing at a time means 16000 interactions with the worker queue - that is 16000 * 16 more opportunities for GIL locking and thread contention.

Many operations in the python stdlib could be made to use multiple threads/processes internally at the library level. Then people wouldn't need to know if the library is using async IO, threads, processes, or alien magic. eg, a url downloader might be given a list of URLs to GET... and return the responses. Underneath the covers it could use an async library, or even multiple threads to get the job done. This library level parallelism can happen if the APIs are given a batching design... that is given multiple inputs, and expecting multiple outputs. Then it is possible for library authors to try and choose the best technique. Taking single inputs, and giving single outputs does not let the library authors implement parallel algorithms as easily.


In conclusion, there is a lot we can do to improve python on multi cpu systems without removing the GIL entirely. An incremental approach to removing it from parts of python is a good idea as shown by other projects doing the same. Removing reference counting from python is not needed, as there are good, proven ways of using reference counting in multi CPU systems.

6 comments:

DanielDittmar said...

The "high performance" of the mentioned thread safe reference counting schemes is probably high only in comparison with other synchronization schemes, not with the simple increment / decrement. Because of read/write memory barriers, performance will take quite a hit.

ferringb said...

I think it's a bit of a stretch to say reference counting is easier on devs- it's one of the most common gotchas in writing an extension (does this function steal a reference, or return one? etc).

As for making it easier for devs at the python level (rather then c), the only potential benefit I could see there is in deterministic destruction of an object when you immediately stop using it (realistically in python, that being when the frame goes away), which was never guranteed (hence the addition of 'with').

I'd be curious about the perf. hit for using atomic incref/decref vs non atomic on a sidenote- across all platforms specifically (I'd presume it's Fast Enough on the mainline hardware).

As for parallelizing import, doubt that's an option. Import has side effects, thus applying a map to it isn't much of an option. Consider importing a module that monkeypatches another module, and you'll see where this gets problematic. Fun notion though. Yes, said modules have thread safety issues of their own, but what you're proposing is going parallel even if there is one thread of execution- it would break any such monkeypatching folks have deployed in single thread executions.

As for multiprocessing/forking, punting refcounting is actually a bit beneficial there- you tout COW, but ignore that the refcounting is done on the object itself. End result, pretty quickly the child python process grows it's own copies of shared data just by passing it around. The literal PyObject definition kind of fights your proposal there and argues for GC (or pushing the refcounts out of PyObject if the ~16% overhead can be addressed per recent u-s discussions).

The suggestion of madvise usage is also ignoring the syscall overhead, and that there actually aren't that many objects that would benefit from a RO; code objects possibly come to mind. Your normal string technically is mutable if you own the ref... cpython itself abuses this for optimizing `x += 'asdf'` for example.

Personally I'd suggest trying out some of your ideas- suspect they'll be harder then you might think or may not work as well as one might expect. I'd point out in addition that a lot of the bits your proposing (parallelized fetcher) don't need to live in stdlib- realistically they should be proven useful outside of stdlib before being allowed in anyways.

Michael Foord said...

Interesting you quote the PyPy performance figures. One of the big areas where they are faster than CPython even without the JIT is code that allocates and deallocates a lot of objects.

The main reason they are faster? They don't use reference counting for garbage collection...

Antoine P said...

Simple Direct Media Layer also has a cross platform atomic operations API in SDL 1.3.

See here if you think atomic instructions have no performance impact:
http://mail.python.org/pipermail/python-ideas/2009-November/006599.html

Reference counting is everywhere in the interpreter. Even if you add two integers ("1+1") or simply call a function, you will be touching reference counts. They are in almost all critical paths. Making them fast enough while thread-safe is very difficult (impossible?).

Now PyPy has optimised the memory layout of their objects a lot, and it is very nice, but the benchmark which was used in the quoted snippet is really an artificial microbenchmark. I agree it would be good to improve Python's memory consumption, though, but it's not easy to do so.

Antoine P said...

As for the performance of the Queue object, most of it is probably due to it being written in Python (while the built-in list object is in C).

As for the Queue releasing the GIL when you call get(), it's obviously a feature, since it's meant to wait until something is available in queue. True, if you know you'll never have to wait, using a list (or a deque, or whatever written in C) will be faster.

I'm not sure why you think grabbing the GIL is costly. Sure, it's not totally cheap, but it's also not done that often (at least in 3.2). Also, modern OSes (which might exclude Mac OS X) have optimized their concurrency primitives. For example, a normal uncontended lock should be very fast.

kayschluehr said...

As for parallelizing import, doubt that's an option. Import has side effects, thus applying a map to it isn't much of an option. Consider importing a module that monkeypatches another module, and you'll see where this gets problematic.

At worst this degenerates to linear time import and deadlocks if dependency resolution is not careful enough.