Friday, December 25, 2009

python3.2(svn) python2.7(svn) and the new GIL, cherrypy as a test.

I've done some quick benchmarks for the unreleased python3.2 on the unreleased cherrypy webserver for python 3. Also with the unreleased python2.7.

Here are the results...

python3.1

Client Thread Report (1000 requests, 14 byte response body, 10 server threads):

threads | Completed | Failed | req/sec | msec/req | KB/sec |
25 | 1000.0 | 0.0 | 533.32 | 1.875 | 93.75 |
50 | 1000.0 | 0.0 | 525.86 | 1.902 | 92.69 |
100 | 1000.0 | 0.0 | 522.96 | 1.912 | 92.1 |
200 | 1000.0 | 0.0 | 523.83 | 1.909 | 92.25 |
400 | 1000.0 | 0.0 | 506.92 | 1.973 | 89.27 |
Average | 1000.0 | 0.0 | 522.578 | 1.9142 | 92.012 |

python3.2

Client Thread Report (1000 requests, 14 byte response body, 10 server threads):

threads | Completed | Failed | req/sec | msec/req | KB/sec |
25 | 1000.0 | 0.0 | 555.72 | 1.799 | 97.78 |
50 | 1000.0 | 0.0 | 558.86 | 1.789 | 98.52 |
100 | 1000.0 | 0.0 | 552.87 | 1.809 | 97.45 |
200 | 1000.0 | 0.0 | 546.09 | 1.831 | 96.27 |
400 | 1000.0 | 0.0 | 548.64 | 1.823 | 96.53 |
Average | 1000.0 | 0.0 | 552.436 | 1.8102 | 97.31 |

So here you can see a small improvement in the 400 threads version with the new GIL in python3.2.

Python 3.2 threads seem more scalable in this benchmark compared to python 3.1. Also faster overall (20-40 requests per second faster).

python2.6

Python2.6 still beats python3.2 for cherrypy benchmarks. These are a network IO heavy, string processing heavy, with big python thread thread usage. So an ok benchmark for the new GIL work in my opinion.

Note that both python3.2, and cherrypy for python 3 are not released.
Client Thread Report (1000 requests, 14 byte response body, 10 server threads):

threads | Completed | Failed | req/sec | msec/req | KB/sec |
25 | 1000.0 | 0.0 | 660.54 | 1.514 | 116.43 |
50 | 1000.0 | 0.0 | 671.01 | 1.49 | 118.28 |
100 | 1000.0 | 0.0 | 663.84 | 1.506 | 117.12 |
200 | 1000.0 | 0.0 | 664.85 | 1.504 | 117.19 |
400 | 1000.0 | 0.0 | 651.9 | 1.534 | 114.8 |
Average | 1000.0 | 0.0 | 662.428 | 1.5096 | 116.764 |

python2.7

Python2.7 is faster still.
Client Thread Report (1000 requests, 14 byte response body, 10 server threads):

threads | Completed | Failed | req/sec | msec/req | KB/sec |
25 | 1000.0 | 0.0 | 695.33 | 1.438 | 122.79 |
50 | 1000.0 | 0.0 | 684.6 | 1.461 | 121.12 |
100 | 1000.0 | 0.0 | 688.99 | 1.451 | 121.67 |
200 | 1000.0 | 0.0 | 682.94 | 1.464 | 120.49 |
400 | 1000.0 | 0.0 | 641.01 | 1.56 | 112.78 |
Average | 1000.0 | 0.0 | 678.574 | 1.4748 | 119.77 |

It's also worth noting that 100-120%(out of 200%) of the two cores cpus are during the run of each version of python tested. Even though the GIL is released for many parts with python and cherrypy, and even though the benchmark is very IO heavy, both cores are not loaded. It's generally a good thing for a webserver to not load up the CPUs - but in a benchmark you want them to go full speed.

Also, during the tests the benchmarking tool 'ab' is run on the same machine skewing results. However ab seems to only use 1% of the CPU during the tests (according to top).

1.6ghz core 2 duo, ubuntu (save the) karmic koala 32bit version.

Shrinking the stack to save some memory.


So how do we reduce the memory usage of threaded programs? Reducing the stack size is one idea. Since threads do not share a stack with each other, each thread takes up a nice big stack. On ubuntu karmic koala it seems to default to 8MB or so.

Note, this is dangerous and can segfault your interpreter if you do not have enough stack space for some operations. So make sure you test things well before doing this.


In 2.6 you do:

import thread
thread.stack_size(32768*2)


Whereas in python 3.x you do:
import threading
threading.stack_size(32768)

3.2
----
(default stack) * 110MB virt, 10MB resident
(adjusted stack) * 15MB virt, 10MB resident

2.6
----
(default stack) * 107MB virt, 8MB resident
(adjusted stack) * 12MB virt, 8MB resident

Note, the python3.x versions start using up 2 gigabytes of memory in the benchmark. Compared to a peak of 600MB or so with python2.6. Showing some sort of memory leak in either python or cherrypy (note, both are pre-release).

It seems python3.2 worked with a smaller stack compared to python2.6. python2.6 segfaulted with 32768, but worked with a stack sized 32768*2. However python3.x used more resident memory, and virtual memory.

update:
I've uploaded profile(python cProfile), results and scripts here: http://rene.f0o.com/~rene/stuff/cherrypy_benchmarks. Strangely using python2.7 with the benchmark and cProfile did not finish (well it was still going 30 minutes later). So I guess that is a bug somewhere.


7 comments:

pitrou said...

Thank you, interesting benchmarks.

If you have a bit more time it would be nice to profile your test case, in order to know what makes 3.2 slower than 2.7.

Regards

Antoine Pitrou.

illume said...

ah, good idea. I'll upload the code used to make them too.

cheers,

illume said...

hello again Antoine,

I've uploaded the profile(python cProfile), results and scripts here: http://rene.f0o.com/~rene/stuff/cherrypy_benchmarks

So you should be able to see profile details, and also repeat the tests yourself if you wish.

Strangely using python2.7 with the benchmark and cProfile did not finish (well it was still going 30 minutes later). So I guess that is a bug somewhere. I'll leave it on for much longer and see if it completes at all.

I haven't had time to look through the profiling results so far. If there is a particular profile view you would like to see please let me know?


cheers,

pitrou said...

Hmm, the profile dump states that most of the time is spent in the sleep() method... which probably means that either you profiled the wrong script, or (more likely) profiling doesn't cope very well with multi-threaded workloads. Aw, too bad.

Antoine.

cats2ndlife said...

I'm not sure what this benchmark is trying to find out. Are you testing CherryPy or Python's new GIL?

If you are trying to find out how the latest CherryPy scales with the different versions of Python, why don't you use a more realistic response size like a 20k string with about 1k of random data sprinkled on fixed but non-continuous parts of the response strings to simulate dynamic responses' caching and serializing behavior?

For all of the benchmark numbers posted, the request numbers served barely went down at all from 25 threads to 400 threads, so there's not much meaning here both CP and Python. The only thing this benchmark shows is Python 2.x is about 1/6 faster then 3.x.

Also, it would be very interesting to see 2 CP processes running on your box. 1 process per core, and see if they make any difference.

Daniel Skinner said...

600megs and 2gigs memory consumption seems extreme. If your running the tests on a cherrypy app with sessions turned on, make sure to set it to be file based, otherwise your just going to build up a ton of _RLock objects in memory

hueso said...

Cherrypy also performs good on Windows environment implemented as a server? CherryPy 3.1.2 & Windows Server 2003
thanks!
Martin