Best and easiest approach for CPython speed ups. Processor specific C modules to distutils. mmx, sse, 3dnow etc.
For pygame which uses SDL some parts are written in assembler. These parts detect which cpu they are running on and then use those mmx/sse/3dnow optimized assembly routines if those processor features are available.
However much of pygame and SDL is still in C, not ASM. So compiling using processor specific features does give a very nice speed up for the C parts.
For some parts a 33% speedup or more can be gained. Just by changing compilation flags. I think this is the best and easiest approach to getting parts of CPython to be sped up. Below are my notes, and thoughts about the topic. Mostly this is based around my experimentation with pygame compilation. This does not take into consideration python compilation itself, or any of its modules. However a similar methodology could be applied to speed up pythons modules too. Note that recompiling python itself to match your processor can give a large speedup too.
So, I have started experimenting changing distutils to compile modules multiple times with different flags for processor specific things. eg. amodule_mmx.so amodule_sse.so etc. This is just for pygame at the moment, not for python modules in general.
I am thinking of doing three or six sets of modules... to keep the disk usage down. An athlon 3dnow version, a p3 SSE version, a pentium mmx version, etc. All of the pygame C modules are around 430KB. However if I only do the ones that are generally cpu intensive it goes down to 230KB. So an extra 690KB to 1380KB uncompressed. Compressed it is about 348KB to 696KB extra.
So if I add that to the windows installer, or to the .deb packages it's not too bad. For people who compile it for their machine I'm thinking of just making it compile only the one which matches their machine. Which means I'll need to put cpu detection code into the compile phase.
For people who distribute their own py2exed version, they'll have the option of including extra cpu specific modules, or keeping their download a little smaller. Or even including more cpu specific modules if they want.
However I still need to finish the cpu detection code. I'm going to base it off SDL code. Since pygame requires SDL anyway, and I'm sure that code is widely tested. I'm thinking of having a python wrapper function which detects the cpu features, then tries to import the relevant processor specific .so. Then it keeps trying to import ones best first until it finds one. So if no processor specific ones are found, then it uses the default one.
Another issue to think about is if specific cpu instruction sets give enough of a performance boost or not. So if pentium mmx is almost as fast or faster than P4 SSE2 code, then I might as well not include the P4 SSE2 version of the module. This will require profiling to figure out.
So profiling more of pygame is needed to get better results. For example a script which loads lots of jpg and png files. Something which blits lots of stuff. etc etc. Each of these profiling tests should output timing data in a standard way. So that they can be run automatically, and then submit their data. I think I'll setup a web page to collect this data. So if people are able to help, they can select to submit data from their machine.
Also needed is better automated testing coverage of pygame. So that I can test that the recompiled executables run correctly. This is especially needed since python uses experimental compilation flags on some platforms. ie gcc -O3. -O3 has some known to be potentially buggy optimizations included. I have come across situations where compiling python extension modules is buggy because of this use. Also some of the processor specific optimizations are not as widely used or tested.
Another downside to this is that distutils changes with python versions. So any 'monkey patches' to it need to be tested with new python versions. This needs to be done for windows pygame anyway. Since it patches distutils for the Mingw MYSYS compilation environment. Eventually once this technique is perfected patches will hopefully make it back into distutils of python. Meaning less work for each python release. However since pygame needs to be recompiled, and tested for new python versions anyway, this isn't a major issue.
The funny thing is though... I'm not even sure if the python distribution on windows works with older cpus. Since they have compiled python with the newer C library. This means it won't run on some older computers without them getting the C runtime. That's with py2exe versions of the programs anyway.
This is just going to be for x86 win/*nix/*bsd machines using GCC for now. Not for Mac machines, because I don't have one to test with, and the binary situation there is already weird enough. I will also not use the intel, microsoft, or VectorC compilers to start with. Even though they are better at some optimizations than GCC. That is an exercise I'll leave until later.
Hopefully this should allow pygame users to not worry as much about optimization. Or at least people will be able to put more sprites on the screen before worrying ;) It will also make many existing games run faster or use less resources whilst running.
However much of pygame and SDL is still in C, not ASM. So compiling using processor specific features does give a very nice speed up for the C parts.
For some parts a 33% speedup or more can be gained. Just by changing compilation flags. I think this is the best and easiest approach to getting parts of CPython to be sped up. Below are my notes, and thoughts about the topic. Mostly this is based around my experimentation with pygame compilation. This does not take into consideration python compilation itself, or any of its modules. However a similar methodology could be applied to speed up pythons modules too. Note that recompiling python itself to match your processor can give a large speedup too.
So, I have started experimenting changing distutils to compile modules multiple times with different flags for processor specific things. eg. amodule_mmx.so amodule_sse.so etc. This is just for pygame at the moment, not for python modules in general.
I am thinking of doing three or six sets of modules... to keep the disk usage down. An athlon 3dnow version, a p3 SSE version, a pentium mmx version, etc. All of the pygame C modules are around 430KB. However if I only do the ones that are generally cpu intensive it goes down to 230KB. So an extra 690KB to 1380KB uncompressed. Compressed it is about 348KB to 696KB extra.
So if I add that to the windows installer, or to the .deb packages it's not too bad. For people who compile it for their machine I'm thinking of just making it compile only the one which matches their machine. Which means I'll need to put cpu detection code into the compile phase.
For people who distribute their own py2exed version, they'll have the option of including extra cpu specific modules, or keeping their download a little smaller. Or even including more cpu specific modules if they want.
However I still need to finish the cpu detection code. I'm going to base it off SDL code. Since pygame requires SDL anyway, and I'm sure that code is widely tested. I'm thinking of having a python wrapper function which detects the cpu features, then tries to import the relevant processor specific .so. Then it keeps trying to import ones best first until it finds one. So if no processor specific ones are found, then it uses the default one.
Another issue to think about is if specific cpu instruction sets give enough of a performance boost or not. So if pentium mmx is almost as fast or faster than P4 SSE2 code, then I might as well not include the P4 SSE2 version of the module. This will require profiling to figure out.
So profiling more of pygame is needed to get better results. For example a script which loads lots of jpg and png files. Something which blits lots of stuff. etc etc. Each of these profiling tests should output timing data in a standard way. So that they can be run automatically, and then submit their data. I think I'll setup a web page to collect this data. So if people are able to help, they can select to submit data from their machine.
Also needed is better automated testing coverage of pygame. So that I can test that the recompiled executables run correctly. This is especially needed since python uses experimental compilation flags on some platforms. ie gcc -O3. -O3 has some known to be potentially buggy optimizations included. I have come across situations where compiling python extension modules is buggy because of this use. Also some of the processor specific optimizations are not as widely used or tested.
Another downside to this is that distutils changes with python versions. So any 'monkey patches' to it need to be tested with new python versions. This needs to be done for windows pygame anyway. Since it patches distutils for the Mingw MYSYS compilation environment. Eventually once this technique is perfected patches will hopefully make it back into distutils of python. Meaning less work for each python release. However since pygame needs to be recompiled, and tested for new python versions anyway, this isn't a major issue.
The funny thing is though... I'm not even sure if the python distribution on windows works with older cpus. Since they have compiled python with the newer C library. This means it won't run on some older computers without them getting the C runtime. That's with py2exe versions of the programs anyway.
This is just going to be for x86 win/*nix/*bsd machines using GCC for now. Not for Mac machines, because I don't have one to test with, and the binary situation there is already weird enough. I will also not use the intel, microsoft, or VectorC compilers to start with. Even though they are better at some optimizations than GCC. That is an exercise I'll leave until later.
Hopefully this should allow pygame users to not worry as much about optimization. Or at least people will be able to put more sprites on the screen before worrying ;) It will also make many existing games run faster or use less resources whilst running.
Comments