Friday, January 08, 2010

Unladen swallow review. Still needs work.

Tried out unladen swallow on two work loads today. After the announcement they are going to try and bring it into core. So I finally got around to trying it (last time the build failed). An hour or so later, the build finally finished and I could try it out. The C++ llvm takes ages to compile, which is what took most of the extra time. What follows is a review of unladen swallow - as it stands today.

The good part? Extensions work(mostly)! w00t. I could compile the C extension pygame, and run things with it.

Now to run code I care about, my work loads - to see if their numbers hold true for me.
cherrypy webserver benchmark: crash
pygame tests: some crashes, mostly work.
pygame.examples.testsprite : random pauses in the animation.


The crashes I've found so far seem to be thread related I guess. The cherrypy one, and some of the pygame ones both use threads, so I'm guessing that's it.

Random pauses for applications is a big FAIL. Animations fail to work, and user interactions pause or stutter. Web requests can take longer for unknown reasons etc. I'm not sure what causes the pauses, but they be there(arrrr, pirate noise).

LLVM is a big, fast moving dependency written with another language, and a whole other runtime (C++). Unladen swallow uses a bundled version of it, since they often need the latest... and they need the latest fixes to it. This might make it difficult for OS packagers. Or LLVM might stabalise soon, and it could be a non-issue. Depending on C++ is a big issue for some people. Since some applications and environments can not use C++.

The speed of unladen swallow? Slower than normal python for *my* benchmarks. Well, I couldn't benchmark some things because they crash with unladen... so eh. I might be able to track these problems down, but I just can't see the benefit so far. My programs I've tried do not go faster, so I'm not going to bother.

Python 3 seems to be 80% of the speed for IO type programs like web servers (cherrypy) (see benchmarks in my previous post). However unladen-swallow only seems to be 10-20% slower for pygame games, but the random pauses make it unusable.

Python2.x + psyco are way faster still on both these work loads. 20%-100% faster than python2.6 alone. Psyco, and stackless are both still being developed, and both seem to be giving better results than unladen swallow. Using selective optimisation with tools like shedskin, tinypyC++, rpython, cython will give you 20x speedups. So for many, writing code in a subset of python to get the speedups is worth it. Other people will be happy to write the 1% of their program that needs the speed in C. This is the good thing about unladen swallow... you should be able to keep using any C/C++/fortran extensions.

Unladen-swallow has a google reality distortion bubble around it. They only benchmark programs they care about, and others are ignored. There are other peoples reports of their programs going slower, or not faster. However the response seems to be 'that is not our goal'. This is fine for them, as they are doing the work, and they want their own work loads to go faster. However, I'm not sure if ignoring the rest of the python communities work loads is a good idea if they are considering moving it into trunk.

It's too early to declare unladen-swallow done, and good imho. I also think better research needs to go into it before declaring it an overall win at all. Outside review should be done to see if it actually makes things quicker/better for people. For my workloads, and for other peoples workloads it is worse. It also adds dependencies to C++ libraries - which is a nono for some python uses. Extra dependencies also increase the startup time. Startup time with unladen swallow is 33% slower compared to python for me (time python -c "import time").

Let's look at one of their benchmarks - html5lib. See the issue html5lib no quicker or slower than CPython . They arranged the benchmark so unladen-swallow is run 10 times, to allow unladen swallow to warm up. Since Cpython is faster the first time through.

blue - unladen-swallow, red - cpython 2.6. Time(y) for 10 runs(x).

Notice, how jumpy the performance is of unladen on the other runs? This might be related to the random pauses unladen swallow has. I don't like this style of benchmark which does not account for the first run. Many times you only want to run code on a set of data once.

When looking at their benchmark numbers, consider how they structure their benchmarks. It's always good to try benchmarking on your own workloads, rather than believing benchmarks from vendors.

Memory usage is higher with unladen swallow. It takes around two times as much memory just to start the interpreter. The extra C++ memory management libraries, the extra set of byte code, and then extra machine code for everything has its toll. Memory usage is very important for servers, and for embedded systems. It is also important for most other types of programs. The main bottleneck is not the cpu, but memory, disk, and other IO. So they are trading better cpu speed (theoretically) for worse memory. However since memory is often the bottleneck - and not the cpu, the runtimes will often be slower for lots of work loads.

It seems python2.6 will still be faster than unladen swallow for many peoples work loads. If they do not get other peoples programs and workloads working faster, or working at all, it will not be a carrot. As peoples programs work, and go faster with python2.6/2.7 it will be a stick*.

Unladen swallow has not (yet) got to it's 5x faster goal, and for many work loads it is still slower or the same speed. For these reasons, I think it's too early to think about incorporating unladen swallow into python.

* (ps... ok, that made no sense, sorry. Sticks and carrots?!?... donkeys like carrots, but so do ponies. I don't think we should hit people with sticks. Also people don't like carrots as much as perhaps chocolate or beer. Perhaps all this time hitting people with sticks and trying to get them to do things with carrots is the problem. Python 3 has heaps of cool things in it already... but more cool things always helps! Beer and chocolate would probably work best.)

22 comments:

Jean-Paul Calderone said...

Thanks for digging into this. I've been getting curious about how U-S does on real applications.

pitrou said...

I'm not sure what you call "half the speed". 550 isn't half of 670 in my book (from your own CherryPy benchmarks).

matt harrison said...

Yeah! A post about Unladen Swallow with actual numbers! Very informative. Thanks much for this.

illume said...

@pitrou: updated it to say 80% of the speed. That's probably more accurate :) However, as py3k has very large memory usage in those benchmarks - 2 gigs Vs 600MB, it can be way slower if you do not have enough ram, and it swaps to disk. Memory usage is the main measurement for speed for lots of applications, as many apps are memory bound.

cya.

pitrou said...

@illume: That's right, the sheer memory consumption looks a bit disturbing... Are you sure they were both compiled with the same options? (and especially the same pointer width: either 32 bits or 64 bits)

Chris Leary said...

Hi Illume,

"Depending on C++ is a big issue for some people. Since some applications and environments can not use C++."

Can you cite sources or provide an example to this effect? I'm curious as to what kind of environments these are.

Alex said...

Chris, I imagine those are embedded environments. Luckily unladen swallow provides a compile time --without-llvm flag that removes the LLVM (and C++ dependency).

Chris Leary said...

@Alex: right, I'm just curious if there are any particular examples that have widespread use. I did some Atmel microcontroller programming and they had g++ available as a front-end since the GCC backend for that family of microcontrollers was already in place.

Good to know about --without-llvm though!

Steve said...

Of course "it's too early to declare unladen swallow done". That's why they haven't.

Quantitative information is always welcome. Have you considered setting up a buildbot or similar continuous integration system to continuously monitor progress?

As to the "Google reality distortion bubble", I don't think Google as a company has a huge amount to gain by propagating misinformation about performance. They appear to already using swallow where it helps (YouTube), and have plenty of other options where it doesn't.

illume said...

Hi Steve,

People are preparing a pep to bring it into the py3k tree. However, the main conclusion of this review was that it is not ready yet to do so. Hopefully the article as a whole reflects that position.

My position is that it is not ready for outside use. Their goals are to speed it up for their own uses - which only makes sense that they work on things for themselves. However, by bringing it into the main cpython, it is only fair to evaluate it by different criteria - instead of the criteria they have set for them selves.

There are a number of bits of quantitative information in the review already of areas and issues to improve. No, I am not going to set up a buildbot for it, or spend any more effort with unladen swallow for now.

They have framed the conversation on measuring U-S performance. That is the reality distortion bubble that I'm talking about.

cheers,

illume said...

hello again,

@pitrou: haven't had a chance to look into the py3k memory usage yet. But it's on my todo list :) I've brought it up with the cherrypy developer(s) too. Hopefully they'll be able to track it down quicker than I can.

Jesse said...

Rene;

First; a PEP bringing it into Py3k will happen, but when, and if it is merged in, it will need to meet compatibility and performance/stability criteria. A PEP is a proposal, not the final step in inclusion.

Saying "it's not ready for a PEP" is akin to saying "anything that has bugs can never be proposed for inclusion into core". U-S has bugs, and they will need to be fixed, like any other piece of software.

Also, you'll note in the wiki pages, they note the increased memory usage. So, that's not new news.

Other people have used U-S outside of you're supposed "reality distortion field" to much success, just check out the mailing list. U-S is faster, and will continue to get faster. So you're hand waving about the trunk/semi unstable version of an in development having bugs.

"Their goals are to speed it up for their own uses" - is simply wrong. See http://code.google.com/p/unladen-swallow/wiki/Benchmarks - they include a fairly robust number of real-world benchmarks. If you want to help; spin a patch for trunk and submit yours.

There are plenty of non google engineers involved in the U-S work, like any other open source project. Accusing them of altering reality, outright lies, etc is a little extreme.

illume said...

@Jesse: I understand why you are so defensive.

I did not say unladen swallow is not ready for a pep, as you falsely claim. I've made some arguments that it is not ready to go into py3k and that it should have more outside review.

Secondly I didn't say U-S was not faster. In fact I showed some places where it was faster. I did however show cases where it was not faster, trying to balance up the only benchmarks I've seen from the project itself.

Framing the conversation of what their goals are for performance is the bubble I speak of.

Should we only be able to talk about the plus sides of their alpha development software? Are no critisisms allowed? Should people only look at benchmarks and claims from vendors?

There are two things that make it sound like the inclusion is a done deal. Firstly your blog post is titled: 'Unladen Swallow: Python 3’s Best Feature'. That makes it sound like it's already in python 3.

Secondly, Guido said "Merging Py3k and Unladen Swallow? SGTM!" on his twitter account. For those not hip with the cool kids slang... SGTM is twitter language for 'Sounds Good To Me'.

From those statements, it sounds like the process was going ahead rather quickly.

From the unladen swallow project:
"Beyond these benchmarks, there are also a variety of workloads we're explicitly not interested in benchmarking" They mention extensions like numpy not being one of the goals of the project.

They also say 'Similarly, workloads that involve a lot of IO like GUIs, databases or socket-heavy apps would, we feel, be inappropriate'. So I feel safe in keeping the claim that they are interested in their own prefered workloads.

I also feel confident in the claim that there are a number of workloads that are not faster with unladen swallow.

StyXman said...

the graph looks strange. the points don't match the markings in the x axis. I guess it's because the markings are spatially equally separated but their numbers not.

Jesse said...

In your response to Steve: "People are preparing a pep to bring it into the py3k tree. However, the main conclusion of this review was that it is not ready yet to do so. Hopefully the article as a whole reflects that position."

Maybe it's language semantics; and can be taken either way. Either you mean it's not ready for a PEP, or the merge back. If the former - the sooner the PEP, the better so that Python-Dev can discuss it ad-nauseum. If the latter; it has bugs - I was a little tweaked that you didn't bother filing them, or posting questions to the mailing list. Later on, I saw you at least filed them, and as you can see here:

http://code.google.com/p/unladen-swallow/issues/detail?id=110

The jitter you noticed on the pygame test is known/intentional. A simple question on the mailing list would have cleared it up instead of unilaterally declaring it "not ready based on my benchmarks".

And of course we can discuss the negatives; I'm not suffering some hangup where only positives can be discussed. What I am asking is that the project itself is given feedback, so it can improve, and correct information is given out.

Some negatives? LLVM is C++, which is a jump from the comfortable C code core development is used to. The speed increases are focused on longer-running processes (which is intentional) so your startup time test is largely irrelevant, but startup *is* slower.

The fact that there are workloads which are not sped up, is again: Known. Not surprising, but I think what got me was the tone of your post, for instance:

"Random pauses for applications is a big FAIL. Animations fail to work, and user interactions pause or stutter. Web requests can take longer for unknown reasons etc. I'm not sure what causes the pauses, but they be there(arrrr, pirate noise)."

That's because it's the JIT compilation kicking in. A simple question in the IRC room or on the mailing list would have answered that for you. It's not "a big FAIL".

Then this:

"Unladen-swallow has a google reality distortion bubble around it. "

The term "reality distortion bubble" is not a compliment. It's an insult. There is no bubble, there is a group of people doing semi-targetted work, some of whom work for google, some of whom do not.

There is no bubble, you are right though: there are things which are not helped - but there are a lot of workloads which *are* helped. So it seems you're throwing the baby out with the bathwater.

Then this:

"They arranged the benchmark so unladen-swallow is run 10 times, to allow unladen swallow to warm up. Since Cpython is faster the first time through."

They didn't arrange anything. You make it sound like there is some mass conspiracy going on. The project has a JIT, the jit compiles hot functions over a period of time/multiple runs. Again; nothing to see here, move along.

end of part 1

Jesse said...

part 2:

Also this:

"The speed of unladen swallow? Slower than normal python for *my* benchmarks. Well, I couldn't benchmark some things because they crash with unladen... so eh. I might be able to track these problems down, but I just can't see the benefit so far. My programs I've tried do not go faster, so I'm not going to bother."

That's a terrible attitude. You threw your pet benchmarks at it, had a problem, and instead of asking why, you threw your hands up and posted this.

No, U-S going in is not a done deal, the opinions of Guido and I notwithstanding. You, and just about anyone interested in core development should know well enough that no changes go into core without extensive, and sometimes painful debate and discussion.

The only vested interest I have is making python better, faster. U-S makes no promises that it will speed up every, single workload. If they did, they would have long ago been called out. The information is sitting there for anyone to read.

The bug tracker and IRC room are open to all, like any other project. There's a difference between pointed "hey, there are problems and here they are, after I even gave the project participants to explain something" and posting hyperbole.

illume said...

Jessie: wow, that's a lot of writing. I'm not going to read it all... but may I suggest you move on to something else?

Collin Winter said...

Hey Rene, I'm the tech lead for Unladen Swallow. I'd like to address some of the points you raised in your post:

- First, I feel your criticism of the html5lib benchmark is unfounded. You said, "I don't like this style of benchmark which does not account for the first run. Many times you only want to run code on a set of data once." That is simply false. The html5lib benchmark *does* take the first run into consideration; if you look at the source code (http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_html5lib.py), you'll see that it does no priming runs, and that all 10 runs are counted toward the final score. Indeed, as I said in http://code.google.com/p/unladen-swallow/issues/detail?id=105#c4 (the same bug report you cited), "I want us to be penalized appropriately for that first slow run, since this isn't a daemon process we're talking about."

Full benchmark results follow. Note that the change is considered statistically "not significant" due to the high first run time, which increases the standard deviation.

### html5lib ###
Min: 13.185609 -> 12.256459: 1.0758x faster
Avg: 13.876981 -> 12.997757: 1.0676x faster
Not significant
Stddev: 1.15007 -> 1.41818: 1.2331x larger
Timeline: http://tinyurl.com/y8aldmr

I do feel that the subsequent runs give a good indication of Unladen's potential, though, and investigating the differences between the first run and the n-1 subsequent runs indicates interesting areas for improvement. The first run is all that counts, after all.

- The pauses you're seeing in your animations come from the fact that Unladen blocks execution to compile functions down to machine code. This totally sucks, and we have a development branch (http://code.google.com/p/unladen-swallow/source/browse/branches/background-thread) dedicated to fixing it. Working within the limitations of CPython's underlying thread model has made this difficult, however, and has required more time/care than we had anticipated, which is why it's not yet merged to trunk. Merging this support will block merger with Py3k, however, since I feel that it is -- rightly -- a show-stopper for applications like yours.

- "It's always good to try benchmarking on your own workloads, rather than believing benchmarks from vendors." I completely agree. If you (or anyone else) find workloads that Unladen does not perform well on, please tell us, don't keep them to yourself! We're far more interested in real-world benchmarks that we fail at, since that gives us concrete areas to target our attention to.

I would hope that if your benchmark found a performance regression in CPython, that you would send a report to python-dev, rather than merely posting about it on your blog and hoping the right people see it.

- "Memory usage is higher with unladen swallow." We have never made a secret of this: our two LLVM-based releases have both counted "increased memory usage" among the lowlights (http://code.google.com/p/unladen-swallow/wiki/Release2009Q2, http://code.google.com/p/unladen-swallow/wiki/Release2009Q3). As you can see from the Release2009Q3 page, however, we have made significant progress toward reducing memory usage in our fastest configuration, but as the release notes acknowledge, we're still 2-3x higher than CPython. Part of this is inevitable. The necessary analysis/code generation facilities *will* increase memory usage. If memory usage is your primary bottleneck, Unladen Swallow's ./configure has a --without-llvm option, which cuts out any LLVM-facing parts of the system and dramatically lowers memory usage.

We've improved the situation some more in Q4; memory benchmarks will be included the release notes next week. I'll link to them here once I have memory usage data for all benchmarks (being generated now).

Collin Winter said...

Cont'd.

- "Depending on C++ is a big issue for some people. Since some applications and environments can not use C++....It also adds dependencies to C++ libraries - which is a nono for some python uses." Can you elaborate on this? Which uses will prevent the use of C++ libraries? Which applications? Which environments? Merely saying "some" is not very helpful.

- "They only benchmark programs they care about, and others are ignored. There are other peoples reports of their programs going slower, or not faster. However the response seems to be 'that is not our goal'. This is fine for them, as they are doing the work, and they want their own work loads to go faster. However, I'm not sure if ignoring the rest of the python communities work loads is a good idea if they are considering moving it into trunk."

I presume you are referring to our conscious decision to avoid working on optimizing numeric code. Yes, we have prioritized the workloads we see most commonly in the Python community (of which Google is a member): text-processing and web-apps. We have not said, "no"; our answer has been, "not yet". We feel that given the presence of high-quality libraries like NumPy to support high-performance numeric computing in Python, our time would be better-spent working on other areas of the interpreter.

Also, there *are* reports of real-world workloads going faster when using Unladen Swallow: http://groups.google.com/group/unladen-swallow/browse_thread/thread/289f25d6f3fed454/8e88d8b189669afb is one such report.

I would also point you to the recent thread about html5lib being slower on Unladen Swallow (http://code.google.com/p/unladen-swallow/issues/detail?id=105): the user reported a workload that Google, frankly, cares nothing about, but we investigated, found areas for improvement, and have already implemented patches to address some of the shortcomings. Please do not say that we only care about our own workloads, since that is obviously incorrect.

- "pygame tests: some crashes, mostly work." I installed CPython 2.6.1 (Unladen Swallow's baseline) and installed Pygame 1.9.1 from SVN (svn://seul.org/svn/pygame/tags/release_1_9_1release), compiled against libsdl 1.2.13_6 from Macports on Darwin (OS X 10.5.8; Apple gcc 4.0.1). Running `/tmp/cpython2.6/bin/python setup.py test` results in 17 failed tests, the exact same number as failed by Unladen Swallow on the same system.

Which version of Pygame are you using? What operating system/compiler/libsdl? Does CPython 2.6.1 pass all of Pygame's tests on your system? What options was your CPython configured with? If CPython 2.6.1 doesn't pass all of Pygame's tests, I think it's unrealistic and unfair to hold Unladen Swallow to a different standard.


Lastly, I would completely agree that Unladen Swallow still needs work. What we have sought to create is not a finished product, but a stable platform that will yield additional performance gains for years to come. We have done the work of adding a JIT compiler to CPython, but a JIT compiler is an extremely flexible tool, and we have by no means exhausted its full potential. I'm open to creating a branch of py3k in the main CPython repository and fixing the punchlist of critical issues there, rather than immediate, outright merger with py3k. Unladen Swallow has always been a branch, and this is a branch we want to land eventually. The sooner we expose the technology to the full diversity of the Python community, the faster the implementation will mature. Google has been an informative test bed, but Google's Python usage is far more homogeneous than that outside the company.

I look forward to discussing this with you further.

Thanks,
Collin Winter

Jeffrey Yasskin said...

Re "Unladen swallow uses a bundled version of it, since they often need the latest... and they need the latest fixes to it. This might make it difficult for OS packagers.":

As one of the Unladen Swallow developers and a core Python developer, I completely agree. I'd never want to merge something to CPython that required building LLVM as part of the Python build. That's why my main task for the next 2-3 months is to make sure LLVM 2.7 is good enough that we can just specify a dependency on it, and not have to include it in CPython.

manuelg said...

illume: wow, that's a lot of writing. I'm not going to read it all... but may I suggest you move on to something else?

illume said...

@manuelg: done :)