Comments on making apps, making webs.: Unladen swallow review. Still needs work.

@manuelg: done :)

2010-01-11T17:18:06.023+00:00

@manuelg: done :)

illume: wow, that's a lot of writing. I'm ...

2010-01-11T17:15:40.416+00:00

illume: wow, that's a lot of writing. I'm not going to read it all... but may I suggest you move on to something else?

Re "Unladen swallow uses a bundled version of...

2010-01-11T16:59:53.028+00:00

Re "Unladen swallow uses a bundled version of it, since they often need the latest... and they need the latest fixes to it. This might make it difficult for OS packagers.":

As one of the Unladen Swallow developers and a core Python developer, I completely agree. I'd never want to merge something to CPython that required building LLVM as part of the Python build. That's why my main task for the next 2-3 months is to make sure LLVM 2.7 is good enough that we can just specify a dependency on it, and not have to include it in CPython.

Cont'd. - "Depending on C++ is a big iss...

2010-01-11T16:24:15.590+00:00

Cont'd.

- "Depending on C++ is a big issue for some people. Since some applications and environments can not use C++....It also adds dependencies to C++ libraries - which is a nono for some python uses." Can you elaborate on this? Which uses will prevent the use of C++ libraries? Which applications? Which environments? Merely saying "some" is not very helpful.

- "They only benchmark programs they care about, and others are ignored. There are other peoples reports of their programs going slower, or not faster. However the response seems to be 'that is not our goal'. This is fine for them, as they are doing the work, and they want their own work loads to go faster. However, I'm not sure if ignoring the rest of the python communities work loads is a good idea if they are considering moving it into trunk."

I presume you are referring to our conscious decision to avoid working on optimizing numeric code. Yes, we have prioritized the workloads we see most commonly in the Python community (of which Google is a member): text-processing and web-apps. We have not said, "no"; our answer has been, "not yet". We feel that given the presence of high-quality libraries like NumPy to support high-performance numeric computing in Python, our time would be better-spent working on other areas of the interpreter.

Also, there *are* reports of real-world workloads going faster when using Unladen Swallow: http://groups.google.com/group/unladen-swallow/browse_thread/thread/289f25d6f3fed454/8e88d8b189669afb is one such report.

I would also point you to the recent thread about html5lib being slower on Unladen Swallow (http://code.google.com/p/unladen-swallow/issues/detail?id=105): the user reported a workload that Google, frankly, cares nothing about, but we investigated, found areas for improvement, and have already implemented patches to address some of the shortcomings. Please do not say that we only care about our own workloads, since that is obviously incorrect.

- "pygame tests: some crashes, mostly work." I installed CPython 2.6.1 (Unladen Swallow's baseline) and installed Pygame 1.9.1 from SVN (svn://seul.org/svn/pygame/tags/release_1_9_1release), compiled against libsdl 1.2.13_6 from Macports on Darwin (OS X 10.5.8; Apple gcc 4.0.1). Running `/tmp/cpython2.6/bin/python setup.py test` results in 17 failed tests, the exact same number as failed by Unladen Swallow on the same system.

Which version of Pygame are you using? What operating system/compiler/libsdl? Does CPython 2.6.1 pass all of Pygame's tests on your system? What options was your CPython configured with? If CPython 2.6.1 doesn't pass all of Pygame's tests, I think it's unrealistic and unfair to hold Unladen Swallow to a different standard.

Lastly, I would completely agree that Unladen Swallow still needs work. What we have sought to create is not a finished product, but a stable platform that will yield additional performance gains for years to come. We have done the work of adding a JIT compiler to CPython, but a JIT compiler is an extremely flexible tool, and we have by no means exhausted its full potential. I'm open to creating a branch of py3k in the main CPython repository and fixing the punchlist of critical issues there, rather than immediate, outright merger with py3k. Unladen Swallow has always been a branch, and this is a branch we want to land eventually. The sooner we expose the technology to the full diversity of the Python community, the faster the implementation will mature. Google has been an informative test bed, but Google's Python usage is far more homogeneous than that outside the company.

I look forward to discussing this with you further.

Thanks,
Collin Winter

Hey Rene, I'm the tech lead for Unladen Swallo...

2010-01-11T16:22:14.509+00:00

Hey Rene, I'm the tech lead for Unladen Swallow. I'd like to address some of the points you raised in your post:

- First, I feel your criticism of the html5lib benchmark is unfounded. You said, "I don't like this style of benchmark which does not account for the first run. Many times you only want to run code on a set of data once." That is simply false. The html5lib benchmark *does* take the first run into consideration; if you look at the source code (http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_html5lib.py), you'll see that it does no priming runs, and that all 10 runs are counted toward the final score. Indeed, as I said in http://code.google.com/p/unladen-swallow/issues/detail?id=105#c4 (the same bug report you cited), "I want us to be penalized appropriately for that first slow run, since this isn't a daemon process we're talking about."

Full benchmark results follow. Note that the change is considered statistically "not significant" due to the high first run time, which increases the standard deviation.

### html5lib ###
Min: 13.185609 -> 12.256459: 1.0758x faster
Avg: 13.876981 -> 12.997757: 1.0676x faster
Not significant
Stddev: 1.15007 -> 1.41818: 1.2331x larger
Timeline: http://tinyurl.com/y8aldmr

I do feel that the subsequent runs give a good indication of Unladen's potential, though, and investigating the differences between the first run and the n-1 subsequent runs indicates interesting areas for improvement. The first run is all that counts, after all.

- The pauses you're seeing in your animations come from the fact that Unladen blocks execution to compile functions down to machine code. This totally sucks, and we have a development branch (http://code.google.com/p/unladen-swallow/source/browse/branches/background-thread) dedicated to fixing it. Working within the limitations of CPython's underlying thread model has made this difficult, however, and has required more time/care than we had anticipated, which is why it's not yet merged to trunk. Merging this support will block merger with Py3k, however, since I feel that it is -- rightly -- a show-stopper for applications like yours.

- "It's always good to try benchmarking on your own workloads, rather than believing benchmarks from vendors." I completely agree. If you (or anyone else) find workloads that Unladen does not perform well on, please tell us, don't keep them to yourself! We're far more interested in real-world benchmarks that we fail at, since that gives us concrete areas to target our attention to.

I would hope that if your benchmark found a performance regression in CPython, that you would send a report to python-dev, rather than merely posting about it on your blog and hoping the right people see it.

- "Memory usage is higher with unladen swallow." We have never made a secret of this: our two LLVM-based releases have both counted "increased memory usage" among the lowlights (http://code.google.com/p/unladen-swallow/wiki/Release2009Q2, http://code.google.com/p/unladen-swallow/wiki/Release2009Q3). As you can see from the Release2009Q3 page, however, we have made significant progress toward reducing memory usage in our fastest configuration, but as the release notes acknowledge, we're still 2-3x higher than CPython. Part of this is inevitable. The necessary analysis/code generation facilities *will* increase memory usage. If memory usage is your primary bottleneck, Unladen Swallow's ./configure has a --without-llvm option, which cuts out any LLVM-facing parts of the system and dramatically lowers memory usage.

We've improved the situation some more in Q4; memory benchmarks will be included the release notes next week. I'll link to them here once I have memory usage data for all benchmarks (being generated now).

Jessie: wow, that's a lot of writing. I'm...

2010-01-11T15:30:41.316+00:00

Jessie: wow, that's a lot of writing. I'm not going to read it all... but may I suggest you move on to something else?

part 2: Also this: "The speed of unladen sw...

2010-01-11T15:24:00.126+00:00

part 2:

Also this:

"The speed of unladen swallow? Slower than normal python for *my* benchmarks. Well, I couldn't benchmark some things because they crash with unladen... so eh. I might be able to track these problems down, but I just can't see the benefit so far. My programs I've tried do not go faster, so I'm not going to bother."

That's a terrible attitude. You threw your pet benchmarks at it, had a problem, and instead of asking why, you threw your hands up and posted this.

No, U-S going in is not a done deal, the opinions of Guido and I notwithstanding. You, and just about anyone interested in core development should know well enough that no changes go into core without extensive, and sometimes painful debate and discussion.

The only vested interest I have is making python better, faster. U-S makes no promises that it will speed up every, single workload. If they did, they would have long ago been called out. The information is sitting there for anyone to read.

The bug tracker and IRC room are open to all, like any other project. There's a difference between pointed "hey, there are problems and here they are, after I even gave the project participants to explain something" and posting hyperbole.

In your response to Steve: "People are prepar...

2010-01-11T15:23:46.828+00:00

In your response to Steve: "People are preparing a pep to bring it into the py3k tree. However, the main conclusion of this review was that it is not ready yet to do so. Hopefully the article as a whole reflects that position."

Maybe it's language semantics; and can be taken either way. Either you mean it's not ready for a PEP, or the merge back. If the former - the sooner the PEP, the better so that Python-Dev can discuss it ad-nauseum. If the latter; it has bugs - I was a little tweaked that you didn't bother filing them, or posting questions to the mailing list. Later on, I saw you at least filed them, and as you can see here:

http://code.google.com/p/unladen-swallow/issues/detail?id=110

The jitter you noticed on the pygame test is known/intentional. A simple question on the mailing list would have cleared it up instead of unilaterally declaring it "not ready based on my benchmarks".

And of course we can discuss the negatives; I'm not suffering some hangup where only positives can be discussed. What I am asking is that the project itself is given feedback, so it can improve, and correct information is given out.

Some negatives? LLVM is C++, which is a jump from the comfortable C code core development is used to. The speed increases are focused on longer-running processes (which is intentional) so your startup time test is largely irrelevant, but startup *is* slower.

The fact that there are workloads which are not sped up, is again: Known. Not surprising, but I think what got me was the tone of your post, for instance:

"Random pauses for applications is a big FAIL. Animations fail to work, and user interactions pause or stutter. Web requests can take longer for unknown reasons etc. I'm not sure what causes the pauses, but they be there(arrrr, pirate noise)."

That's because it's the JIT compilation kicking in. A simple question in the IRC room or on the mailing list would have answered that for you. It's not "a big FAIL".

Then this:

"Unladen-swallow has a google reality distortion bubble around it. "

The term "reality distortion bubble" is not a compliment. It's an insult. There is no bubble, there is a group of people doing semi-targetted work, some of whom work for google, some of whom do not.

There is no bubble, you are right though: there are things which are not helped - but there are a lot of workloads which *are* helped. So it seems you're throwing the baby out with the bathwater.

Then this:

"They arranged the benchmark so unladen-swallow is run 10 times, to allow unladen swallow to warm up. Since Cpython is faster the first time through."

They didn't arrange anything. You make it sound like there is some mass conspiracy going on. The project has a JIT, the jit compiles hot functions over a period of time/multiple runs. Again; nothing to see here, move along.

end of part 1

the graph looks strange. the points don't matc...

2010-01-11T13:01:46.002+00:00

the graph looks strange. the points don't match the markings in the x axis. I guess it's because the markings are spatially equally separated but their numbers not.

@Jesse: I understand why you are so defensive. I ...

2010-01-11T09:41:31.116+00:00

@Jesse: I understand why you are so defensive.

I did not say unladen swallow is not ready for a pep, as you falsely claim. I've made some arguments that it is not ready to go into py3k and that it should have more outside review.

Secondly I didn't say U-S was not faster. In fact I showed some places where it was faster. I did however show cases where it was not faster, trying to balance up the only benchmarks I've seen from the project itself.

Framing the conversation of what their goals are for performance is the bubble I speak of.

Should we only be able to talk about the plus sides of their alpha development software? Are no critisisms allowed? Should people only look at benchmarks and claims from vendors?

There are two things that make it sound like the inclusion is a done deal. Firstly your blog post is titled: 'Unladen Swallow: Python 3’s Best Feature'. That makes it sound like it's already in python 3.

Secondly, Guido said "Merging Py3k and Unladen Swallow? SGTM!" on his twitter account. For those not hip with the cool kids slang... SGTM is twitter language for 'Sounds Good To Me'.

From those statements, it sounds like the process was going ahead rather quickly.

From the unladen swallow project:
"Beyond these benchmarks, there are also a variety of workloads we're explicitly not interested in benchmarking" They mention extensions like numpy not being one of the goals of the project.

They also say 'Similarly, workloads that involve a lot of IO like GUIs, databases or socket-heavy apps would, we feel, be inappropriate'. So I feel safe in keeping the claim that they are interested in their own prefered workloads.

I also feel confident in the claim that there are a number of workloads that are not faster with unladen swallow.

Rene; First; a PEP bringing it into Py3k will hap...

2010-01-10T14:16:07.244+00:00

Rene;

First; a PEP bringing it into Py3k will happen, but when, and if it is merged in, it will need to meet compatibility and performance/stability criteria. A PEP is a proposal, not the final step in inclusion.

Saying "it's not ready for a PEP" is akin to saying "anything that has bugs can never be proposed for inclusion into core". U-S has bugs, and they will need to be fixed, like any other piece of software.

Also, you'll note in the wiki pages, they note the increased memory usage. So, that's not new news.

Other people have used U-S outside of you're supposed "reality distortion field" to much success, just check out the mailing list. U-S is faster, and will continue to get faster. So you're hand waving about the trunk/semi unstable version of an in development having bugs.

"Their goals are to speed it up for their own uses" - is simply wrong. See http://code.google.com/p/unladen-swallow/wiki/Benchmarks - they include a fairly robust number of real-world benchmarks. If you want to help; spin a patch for trunk and submit yours.

There are plenty of non google engineers involved in the U-S work, like any other open source project. Accusing them of altering reality, outright lies, etc is a little extreme.

hello again, @pitrou: haven't had a chance to...

2010-01-10T10:54:06.536+00:00

hello again,

@pitrou: haven't had a chance to look into the py3k memory usage yet. But it's on my todo list :) I've brought it up with the cherrypy developer(s) too. Hopefully they'll be able to track it down quicker than I can.

Hi Steve, People are preparing a pep to bring it ...

2010-01-10T10:50:42.321+00:00

Hi Steve,

People are preparing a pep to bring it into the py3k tree. However, the main conclusion of this review was that it is not ready yet to do so. Hopefully the article as a whole reflects that position.

My position is that it is not ready for outside use. Their goals are to speed it up for their own uses - which only makes sense that they work on things for themselves. However, by bringing it into the main cpython, it is only fair to evaluate it by different criteria - instead of the criteria they have set for them selves.

There are a number of bits of quantitative information in the review already of areas and issues to improve. No, I am not going to set up a buildbot for it, or spend any more effort with unladen swallow for now.

They have framed the conversation on measuring U-S performance. That is the reality distortion bubble that I'm talking about.

cheers,

Of course "it's too early to declare unla...

2010-01-10T04:28:30.277+00:00

Of course "it's too early to declare unladen swallow done". That's why they haven't.

Quantitative information is always welcome. Have you considered setting up a buildbot or similar continuous integration system to continuously monitor progress?

As to the "Google reality distortion bubble", I don't think Google as a company has a huge amount to gain by propagating misinformation about performance. They appear to already using swallow where it helps (YouTube), and have plenty of other options where it doesn't.

@Alex: right, I'm just curious if there are an...

2010-01-10T04:11:04.922+00:00

@Alex: right, I'm just curious if there are any particular examples that have widespread use. I did some Atmel microcontroller programming and they had g++ available as a front-end since the GCC backend for that family of microcontrollers was already in place.

Good to know about --without-llvm though!

Chris, I imagine those are embedded environments. ...

2010-01-10T03:54:48.403+00:00

Chris, I imagine those are embedded environments. Luckily unladen swallow provides a compile time --without-llvm flag that removes the LLVM (and C++ dependency).

Hi Illume, "Depending on C++ is a big issue ...

2010-01-10T03:30:40.374+00:00

Hi Illume,

"Depending on C++ is a big issue for some people. Since some applications and environments can not use C++."

Can you cite sources or provide an example to this effect? I'm curious as to what kind of environments these are.

@illume: That's right, the sheer memory consum...

2010-01-09T20:50:50.379+00:00

@illume: That's right, the sheer memory consumption looks a bit disturbing... Are you sure they were both compiled with the same options? (and especially the same pointer width: either 32 bits or 64 bits)

@pitrou: updated it to say 80% of the speed. That...

2010-01-09T18:43:58.778+00:00

@pitrou: updated it to say 80% of the speed. That's probably more accurate :) However, as py3k has very large memory usage in those benchmarks - 2 gigs Vs 600MB, it can be way slower if you do not have enough ram, and it swaps to disk. Memory usage is the main measurement for speed for lots of applications, as many apps are memory bound.

cya.

Yeah! A post about Unladen Swallow with actual num...

2010-01-09T16:40:35.713+00:00

Yeah! A post about Unladen Swallow with actual numbers! Very informative. Thanks much for this.

I'm not sure what you call "half the spee...

2010-01-09T16:26:36.843+00:00

I'm not sure what you call "half the speed". 550 isn't half of 670 in my book (from your own CherryPy benchmarks).

Thanks for digging into this. I've been getti...

2010-01-09T14:40:17.271+00:00

Thanks for digging into this. I've been getting curious about how U-S does on real applications.