Experiments with new low latency PyPy garbage collector in a thread.

TLDR; less performance spikes, faster average performance, PyPy is a good Python for games and multimedia apps on desktop systems, running GC in a thread is a good idea for GIL releasing apps.

In March 2018 at the PyPy winter sprint I spent some days working on improving pygame support on PyPy. Part of this was investigating pypy frame drop. In there I found PyPy had both lower average performance in this pygame-using-CPython-API benchmark, and had spikes which caused frame drops. Remember that in latency sensitive applications performance spikes are not allowed. Note that not very much of this code is pure python (where PyPy is fastest), so it's almost a worst case benchmark for PyPy performance.

Thanks to Antonio, the PyPy team and their supporters things have much improved since then. One improvement in particular is that PyPy now allows more control of when the Garbage Collection does its work. Read more about it in the blog post "PyPy for low latency systems".

In this post I'm going to compare my experiments with pygame with the experiments done last year. Additionally I'm going to try a new thing to improve performance which wasn't possible last year.

Running garbage collection in a thread.

Now that we can call garbage collection when we want, how about we do it when we release the GIL and do some other processing in C? PyPy doesn't run the garbage collector in a thread by default, but we can do that ourselves.

Remember that python has a GIL that can be released when you want to do IO or CPU intensive work. Here is an example program which has a thread waiting to be told when to run the incremental garbage collection step.

# Run pypy GC in background.
from Queue import Queue
from threading import Thread
import pygame

def gc_collect_step(q):
    """ this calls the gc in a thread.
    """
    while True:
        print '5) GC THREAD: Waiting to do gc.'
        ready_to_go = q.get()
        print '4) GC THREAD: calling gc_collect_step'
        if hasattr(gc, "collect_step"):
            gc.collect_step()
        q.task_done()

gc_queue = Queue()
worker = Thread(target=gc_collect_step, args=(gc_queue,))
worker.setDaemon(True)
worker.start()

pygame.display.set_mode((640, 480))
clock = pygame.time.Clock()
while True:
    print('1) at top of loop')
    pygame.event.get()
    gc_queue.put(True)

    print('2) tell GC thread to collect')
    print('3) flipping display. GIL released')
    pygame.display.flip()
    print ('6) finished flipping. GIL grabbed')
    clock.tick()
print(clock.get_fps())


This is how the program executes. Notice how it does display.flip and waits on the OS to do VSYNC at some refresh rate (usually 60FPS). In this spot many games and multimedia apps have some spare time to do other work.

Waiting for VSYNC is the perfect spot to do GC!

Trace of program execution when doing GC in background thread.
1) at top of loop
2) tell GC thread to collect
3) flipping display. GIL released
4) GC THREAD: calling gc_collect_step
5) GC THREAD: Waiting to do gc.
6) finished flipping. GIL grabbed
1) at top of loop
2) tell GC thread to collect
3) flipping display. GIL released
4) GC THREAD: calling gc_collect_step
5) GC THREAD: Waiting to do gc.
6) finished flipping. GIL grabbed
1) at top of loop
2) tell GC thread to collect
3) flipping display. GIL released
4) GC THREAD: calling gc_collect_step
5) GC THREAD: Waiting to do gc.
6) finished flipping. GIL grabbed
On my bench-marking setup this gives an increase of 11% frames per second on average. From 42 FPS to 47 FPS on average. Note, for this benchmark python2.7 gets 51 FPS on average.

Running garbage collection in a thread should be able to improve the performance on pypy of many apps which wait on the GIL.


Our performance goal.


Our performance goal is very little variance between frames and no spikes above 0.013. 0.013 is about the amount of time we have to do our work before we ask the frame to be flipped at 60 times per second.

Python 2.7 baseline measurement.


So we have something to compare against, here is a measurement of Python 2.7.

Note various spikes in performance, 51 FPS on average with many frames falling in the 0.005 to 0.007 range (below the required 0.013). The the funny spike at the start. I'm not sure why this is happening now compared to when I measured last year. The underlying OS has changed, or perhaps some other software is running on my computer.

Remember: Our performance goal is very little variance between frames and no spikes above 0.013. 0.013 is about the amount of time we have to do our work before we ask the frame to be flipped.

Python 2.7 isn't that bad here. It doesn't spike over 0.013 but it does have quite a lot of variance in some frames.


PyPy 6.0.0 (last release, does not include new GC stuff).

What about PyPy 6.0.0 which came out some months after my experiments last year. Here we see 0.005 to 0.010 with some spikes of around 0.0125 and higher. This older version of PyPy fails the performance goal.

PyPy 6.0.0 gc.disable().


What if we just disable the garbage collector entirely? (Apart from memory growing until the program starts swapping or crashes). After a few seconds it just starts to slow down a lot.


Time per frame on PyPy nightly (6.1.0 alpha0). gc.disable(), gc.collect_step().


This is using gc.disable() at the top of the main loop, and calling the new gc.collect_step() just before display.flip when most of the work has been done already. Remember that display.flip releases the GIL and waits for VSYNC at 60FPS.

Here we see the average FPS at around 42 with times mostly between 0.006 and 0.008. At the beginning and end there are some spikes. But mostly things stay below the magical 0.015 seconds required to get below 60 FPS. Here is another measurement.

With this second measurement we see that something is going on at the start and at the end again.  Again, mostly there are few spikes above 0.015 and things mostly stay between 0.006 and 0.008. This capture does show one spike, but I found this hard to reproduce.

This is a great improvement over CPython in all except average FPS. The timings in the graphs are the work that is done before display.flip is called.

PyPy nightly (6.1.0 alpha) GC in a thread, gc.disable().


Finally we have the GC running in a thread. This gives us 47.5 FPS on average. Most times stay between 0.6 and 0.8, however now there is a large number of spikes. None of them go above 0.015, but quite a few go above our goal of 0.013.

Why are these spikes here? I'm not sure. It needs to be investigated some more.
I ran the benchmark a few more times, and it seemed to spike less.




So then I switched to using threading.Event instead of threading.Queue, and there seemed to be less spikes over multiple runs. Probably using a lower level synchronization primitive (like threading.Lock) could improve things more. Also using a lower level primitive at C level would probably work even better. Additionally I'm not using the pypy36 which has some (hopefully) improved threading.Event and threading.Queue implementations.



Here's the example using Event instead of Queue.

from threading import Event, Thread, Lock
import pygame

def gc_collect_step(event):
    """ this calls the gc in a thread.
    """
    while True:
        print '5) GC THREAD: Waiting to do gc.'
        event.wait()
        print '4) GC THREAD: calling gc_collect_step'
        if hasattr(gc, "collect_step"):
            gc.collect_step()
        event.clear()

gc_event = Event()
worker = Thread(target=gc_collect_step, args=(gc_event,))
worker.setDaemon(True)
worker.start()


pygame.display.set_mode((640, 480))
clock = pygame.time.Clock()
while True:
    print('1) at top of loop')
    pygame.event.get()
    gc_event.set()

    print('2) tell GC thread to collect')
    print('3) flipping display. GIL released')
    pygame.display.flip()
    print ('6) finished flipping. GIL grabbed')
    clock.tick()
print(clock.get_fps())


Also, you can find the pypy-bench-testsprite.py in a gist. It needs to be run within the pygame examples folder.

PyPy meets performance goals for games

In conclusion:
/Happy dance/
The results that Antonio posted in "PyPy for low latency systems" also appear in the experiments I made. Worst case performance can be increased, and the performance spikes can be removed.

PyPy meets the performance goals for games now if you disable the gc at the top of the main loop, and call the gc collection every frame. The spikes that were there before are gone. The variance in performance is actually less than CPython.

Calling GC in a thread is a viable option which increases the average FPS very close to CPython levels. However some more work needs to be done to try and find a way that avoids the increased performance jitter. Note, that this jitter is still at acceptable levels.

Running the GC in a thread can probably be used for a peak performance increase by many apps  which release the GIL for IO or CPU intensive operations.

I'd like to remind people that this benchmark is a worst case for PyPy and games. In that there is also CPython API used rather than CFFI(fasted ffi on PyPy), and the benchmark contained not much pure python code. If you were also doing physics, collision detection and other stuff in pure python PyPy would pull ahead. Also this was using pygame with SDL1 so that I could sort-of compare performance to the previous time. Pygame 2 with SDL2 runs this benchmark much faster because of hardware acceleration. Another reminder that PyPy 6.1.0 isn't released yet, so the results will differ compared to the nightly version.

The combination of PyPy's JIT with the Cython ahead of time compiler makes a formidable combination. I'm looking forward to using them all with the hardware acceleration that pygame 2 provides (Vulkan, Metal, OpenGL, Direct3d, ...).

Comments

Popular posts from this blog

Is PostgreSQL good enough?

🐱‍🏍 — the first pygame 2 community game. Starting now! Are you in?