Friday, July 07, 2017

SA battery thoughts

There's this worlds largest battery project going on in South Australia.

Tesla sold a financial product. The government needed to cover some of the risk of when power lines go down, or if something goes wrong in another part of the network. It needs to cover this risk quickly for political reasons.
The chance of another similar storm knocking out the power lines again, which now have bigger maintenance crews, is very small. But if there is another blackout and they didn't do anything? They'd be in big trouble with the newspapers.

So Tesla really sold a risk product. Since SA could have spent the money on more generation. But not all people understand that.

The 100MW stage of that wind farm cost $250 million, and took some years to reach agreement, and some years to build. By promising to build the battery quicker, they have covered that risk during the time they need it.

They could have installed another wind farm the same size in a different part of the state in order to reduce the risk, and fill up the valleys of power generation. That has been proven to work too, and the benefit is you have more power generation in the peaks.

The cost of the blackout was estimated to have cost $367 million to business. 12% of the businesses had backup power generators themselves, and about a third of the businesses had bought insurance for such situations. Life critical systems are required to have independent backup power.

By the time the battery is built there will probably be a similar amount of solar power installed as the battery (by current rates of installation). There's 2,034 MW of industrial solar being constructed in Australia for 2017. This doesn't include stuff going onto roofs of houses, of which there are millions of houses already covered and more being done. Solar installed in Australia can be done for $5,000AUD or less for a 5KW system on a house(1.25% of the median house price in SA, or 75% of the average monthly household income in SA for one month). That's $100 million AUD for 100MW on 20,000 homes installed at retail prices. Since the blackout happened, way more than 100MW of solar power has come online already.

The group that runs the grid predicts that by 2023 the entire state could be powered by rooftop solar.

There's also a lead smelter which is being upgraded, so it will have modern equipment which lets it use power more dynamically... effectively making it a battery. It can take in power, or not, as it needs. They can also shut down their power hungry desalination plant if needed (which they don't really need when there is not a drought). Additionally there is an extra interstate power connector also running, which was down during the storm for upgrades.

So now they have a backup battery in the works, a backup gas power plant in the works, backup power lines to another state, more efficient industrial power users, and hundreds of thousands of small independent solar power generators.

They've definitely covered their arses.

Friday, March 31, 2017

Data aware jit/blit - drawing 1.25 to 1.45 times faster.

Drawing different types of pixels can be quicker if you know about the image you are drawing, and if you know that drawing parts of the image with specialised blitters is quicker.

A good example is if your image is 25% areas of large blocks of either white or black. Using a specialised fill routine to just draw those big blocks of color is lots quicker. This is because there is usually an optimised, and hardware accelerated fill routine.

See all this whitespace? (all the white parts on your screen) These can be drawn really quickly with ASIC fill hardware rather than a slow GPU running a general purpose image blitter.

Another example is like this Alien image. The edges of the image are transparent, but the middle has no transparency. Since drawing transparent images is slow, using a different drawing routine for the middle part than the edges turns out to be faster.

Alien graphic used in Pygame Zero teaching framework documentation.
Here is a proof of concept which draws an image used by pygame zero in 80% of the time it normally takes. That is about 1.25 times quicker.

Alien sectioned up, drawn with 5 different blitters, each perfect for the section.

The results vary dramatically depending on the image itself. But the 1.25 times faster is fairly representative of transparent images where the middle part isn't. If it finds sections where the image is a plain colour, that can be 1.42 times faster. Or more. Larger images give you different results as does different hardware. Obviously a platform with a fast path hardware accelerated image fills, or 16 bit image rendering but slow 32bit alpha transparency is going to get a lot bigger speedups with this technique.

Further work is to develop a range of image classifiers for common situations like this, which return custom blitters depending on the image data, and the hardware which it is running on.

(this is one of several techniques I'm working on for drawing things more quickly on slow computers)

Thursday, March 30, 2017

Four new pygame things for slow computers.

There's four things I'd like to work on for faster pygame apps on slower computers (like the Raspberry Pi/Orange Pi).
  • Dirty rect optimizations, to speed up naieve drawing.
  • SDL2, for better hardware support.
  • C/Cython versions of pygame.sprite
  • Surface.blits, a batching API for drawing many images at once.
The general idea with all of these is to take techniques which have proven to improve the performance of real apps on slower computers. Even though, for some of them there are still open questions, I think they are worth pursuing.  Even though more advanced techniques can be used by people to work around these issues, this should be fast even if people do things the easy way on slower computers.

Anyway... this is a summary of the discussions and research on what to do, and a declaration of intent.

Dirty Rect optimizations.

First a couple of definitions. "Dirty Rect" is a technique in graphics where you only update the parts that changed (the dirty parts), and rect is a rectangle encompassing the area that is drawn.

We already have plenty of code for doing this is pretty fantastic ways. LayeredDirty is a particularly good example, which includes overlapping layers and automatically deciding on rendering technique based on what sort of things you're drawing. However, when people just update the whole screen like in pygame zero (the project which has a mission to make things simple for newbies), then there can be performance issues. This change is aimed at making those apps work more quickly without them needing to do anything extra themselves.

So, on to the technique...

Because rectangles can overlap, it's possible to reduce the amount drawing done when using them for dirty rect updating. If there's two rectangles overlapping, then we don't need to over draw the overlapping area twice. Normally the dirty rect update algorithm used just makes a bigger rectangle which contains two of the smaller rectangles. But that's not optimal.

But, as with all over draw algorithms, can it be done fast enough to be worth it?

Here's an article on the topic:

jmm0 made some code here:

DR0ID, also did some with tests and faster code...

So far DR0ID says it's not really fast enough to be worth it in python. However, there are opportunities to improve it. Perhaps a more optimal algorithm, or one which uses C or Cython.

"worst case szenario with 2000 rects it takes ~0.31299996376 seconds"
If it's 20x faster in C, then that gets down to 0.016666. Still too slow perhaps, but maybe not.

For our use case, there might only be 2-10 things drawn. Which seems way under the worst case scenario for performance. Additionally we can use some heuristics to turn it off when it is not worth doing. Like when the whole screen is going to be updated anyway, we don't do it.

Below is one of the cases that benefits the most.
Currently the update method combines two rects like this:
|     |
|   __|____
|___|_|   |
    |     |
    |     |

Into a big rect like this:
|     ####|
|     ####|
|   XX    | # - updated, but no need.
|###      | X - over drawn.
|###      |

Note in these crude diagrams the # area is drawn even though it doesn't need to be, and the XX is over draw. When there are some overlapping rects like this that are large, that can be quite a saving of pixels not needed to be drawn. But what we want is three rects, which do not give us overdraw and do not needlessly update pixels which we do not need to.

|     |
|___|     |
    |     |
    |     |

This can easily save having to draw millions of pixels. But can it be done fast enough, with the right heuristics to be useful?

For a lot of people, and apps, this won't be useful. But I hope it will for those trying to draw things for the first time on slow computers.

There's some more discussion on the pygame mailing list about it, drawbacks, and links to old school Apple region rendering techniques.

SDL 2 hardware support

Version 2 of SDL contains hardware support which makes it faster on some platforms (for some types of apps). This includes the Raspberry PI.

There are actually a few different modules that already have implemented a pygame API subset using SDL2. Which shows me compatibility is important to some.

The approach I want to try going forward is to use a single source SDL1, and SDL2 code base with a compile time option. (like we have single source py2 and py3). There are new SDL2 APIs for hardware acceleration, which can be added later in a way which fits in nicely with our existing API.

Lennard Linstrom has made patches for pygame using SDL2 available here for some years:

The first step is to do an ifdef build for linux, and then do some more testing to confirm areas of compatibility and that the approach is ok.

Additionally we agreed that using Cython is a good idea.

There's quite a long discussion on the pygame mailing list, with more to it than this. But this is the general idea.

C/Cython versions of pygame.sprite

Sprites are high level objects which represent game objects. Graphics with logic.

I've already mentioned things like LayeredDirty, which does all sorts of back ground optimizations and scene management for you.

A lot of this code is in python, and could do with some speed ups. Things like collision detection, and drawing lots of objects are not always bottlenecked by blitting. We've known this ever since the psyco python jit came out. There are other techniques to work around these things, like using something like pymunk for physics, or using a tile based renderer, or using a fast particle renderer. However people still use sprites, for ease of use mainly.

So the plan is to compile them with Pyrex...  err... I mean Cython, and see what sort of improvements we can get that way. I expect naieve collision detection, and drawing will get speed ups.

Surface.blits, a batching API for drawing.

Another area that has been proven to speed up games is by creating a batching API for drawing images. Rather than draw each one at a time, draw them all at once. This way you can avoid the overhead of repeatedly locking the display surface, and you can avoid lots of expensive python function calls and memory allocations.

The proposed names over the years have included blit_list, blits, blit_mult, blit_many call...

I feel we should call it "Surface.blits". To go with drawline/drawlines.
It would take a sequence of tuples which match up with the blit arguments.

The original Surface.blit API.

  blit(source, dest, area=None, special_flags = 0) -> Rect

The new blits API.
    blits(args) -> [rects]
    args = [(source: Surface, dest: Rect, area: Rect = None, special_flags: int = 0), ...]
    Draws a sequence of Surfaces onto this Surface...

    >>> surf.blits([(source, dest),
(source, dest),
(source, dest, area),
(source, dest, area, BLEND_ADD)]
    [Rect(), Rect(), Rect(), Rect()]

One potential option...
  • Have a return_rects=False argument, where if you pass it, then it can return None instead of a list of rects. This way you avoid allocating a list, and all the rects inside it. I'll benchmark this to see if it's worth it -- but I have a feeling all those allocations will be significant. But some people don't track updates, so allocating the rects is not worth it for them. eg. the implementation from Leif doesn't return rects.

It can handle these use cases:
  • blit many different surfaces to one surface (like the screen)
  • blit one surface many times to one surface.
  • when you don't care about rects, it doesn't allocate them.
  • when you do care about update tracking, it can track them.
It can *not* handle (but would anyone care?):
  • blit many surfaces, to many other surfaces.
Areas not included in the scope of this:
  • This could be used by two sprite groups quite easily (Group, RenderUpdates). But I think it's worth trying to compile the sprite groups with Cython instead, as a separate piece of work.
  • Multi processing. It should be possible to use this API to build a multi process blitter. However, this is not addressed in this work. The Surface we are blitting onto could be split up into numberOfCore tiles, and rendered that way. This is classic tile rendering, and nothing in this API stops an implementation of this later.

There's an example implementation by Leif Theden here:

There is always a trade off between making the API fast, and simple enough to be used in lots of different use cases. I feel this API does that. But more benchmarking and research needs to be done first.

There's some discussion of the blits proposal on the pygame mailing list.

Other performance related things.

There was also some discussion of using the quad CPUs on the newer raspberry PI, which are faster than the video hardware on there for some tasks. Unfortunately that won't help the millions of older ones, and even newer ones like the new zero model. I've seen those CPUs outperform the hardware jpeg, and also when used for image classification. However, like taking advantage of METH_FASTCALL in python 3.6, that will have to wait for another time. Separately there's also work going on by other people on other optimization topics. An interesting one is the libjit based graphics blit accelerator.

Thursday, March 23, 2017

pip is broken


Since asking people to use pip to install things, I get a lot of feedback on pip not working. Feedback like this.

"Our fun packaging Jargon"

What is a pip? What's it for? It's not built into python?  It's the almost-default and almost-standard tool for installing python code. Pip almost works a lot of the time. You install things from pypi. I should download pypy? No, pee why, pee eye. The cheeseshop. You're weird. Just call it pee why pee eye. But why is it called pip? I don't know.

"Feedback like this."

pip is broken on the raspberian

pip3 doesn't exist on windows

People have an old pip. Old pip doesn't support wheels. What are wheels? It's a cute bit of jargon to mean a zip file with python code in it structured in a nice way. I heard about eggs... tell me about eggs? Well, eggs are another zip file with python code in it. Used mainly by easy_install. Easy install? Let's use that, this is all too much.

The pip executable or script is for python 2, and they are using python 3.

pip is for a system python, and they have another python installed. How did they install that python? Which of the several pythons did they install? Maybe if they install another python it will work this time.

It's not working one time and they think that sudo will fix things. And now certain files can't be updated without sudo. However, now they have forgotten that sudo exists.

"pip lets you run it with sudo, without warning."

pip doesn't tell them which python it is installing for. But I installed it! Yes you did. But which version of python, and into which virtualenv? Let's use these cryptic commands to try and find out...

pip doesn't install things atomically, so if there is a failed install, things break. If pip was a database (it is)...

Virtual environments work if you use python -m venv, but not virtualenv. Or some times it's the other way around. If you have the right packages installed on Debian, and Ubuntu... because they don't install virtualenv by default.

What do you mean I can't rename my virtualenv folder? I can't move it to another place on my Desktop?

pip installs things into global places by default.

"Globals by default."

Why are packages still installed globally by default?

"So what works currently most of the time?"

python3 -m venv anenv
. ./anenv/bin/activate
pip install pip --upgrade
pip install pygame

This is not ideal. It doesn't work on windows. It doesn't work on Ubuntu. It makes some text editors crash (because virtualenvs have so many files they get sick). It confuses test discovery (because for some reason they don't know about virtual environments still and try to test random packages you have installed). You have to know about virtualenv, about pip, about running things with modules, about environment variables, and system paths. You have to know that at the beginning. Before you know anything at all.

Is there even one set of instructions where people can have a new environment, and install something? Install something in a way that it might not break their other applications? In a way which won't cause them harm? Please let me know the magic words?

I just tell people `pip install pygame`. Even though I know it doesn't work. And can't work. By design. I tell them to do that, because it's probably the best we got. And pip keeps getting better. And one day it will be even better.

Help? Let's fix this.

Tuesday, March 14, 2017

Comments on community

Some notes about the current state of comments, and thoughts about future plans are below.

0) Spam.

So far there hasn't been comment spam on the new comment system(yet!)... but eventually some will get through the different layers of defense. Which are a web app firewall (through cloudflare) (which helps block bots and abusive proxy servers), user signups required, limits on the number of accounts and comments that can be posted per hour, making the spam useless for SEO(nofollow on links) and then a spam classifier.

The spam classifier is pretty basic. It only uses the message text so far, and not other features like 'how old is the poster account', or 'does the user account have another other accounts linked'. Despite that, and only having been trained on a few thousand messages it seems classification works pretty well. It scores 0.97 for ham, and 0.76 for spam when it is cross validated on the test set.

It's sort of weird having source code available for a website, and also trying to ward off spammers. Because if they looked, they can see exactly the measures taken to prevent spam. People who are dedicated to it will be able to easily spam, but casual and automated spam should be able to be stopped.

We used to have a 'grey listing' style account signup, where people could only sign up with a secret link. Whilst this worked ok, it also made it quite challenging. You needed to reach out to the community, or know someone who was already in it. This really did reduce the amount of spam though.

Disqus (a service) commenting was removed, and comments imported from there. This was because they added advertising without getting consent(which I received a lot of complaints about). Additionally we didn't have much control over managing the comments in a way which more suited our community (more on this below).
Gravatars are being used for avatars. There's no profile image for the website itself.

What's left to do with comments...

1) Doc comments

The old "doccomments" need to be moved into the new comment system. This is because the documentation lives in static files and is not produced by our website. Additionally, you can have multiple comments on a single page. Then they need quite some moderation for spam and abuse.

2) Better moderator tools

Adding spam/unspam links for moderators to quickly classify something. Also a list of recent comments that need moderation. The current system for this is really quite clunky.
The aim is to really reduce the work needed to be done by moderators.
Moderating internet comments is soul destroying work... so let's make robots do it.


3) Optionally disabling comments on projects

After some discussions I've decided to add an option for projects to disable comments. This way people don't have to deal with unwanted silly criticism if they don't want to. So, if someone is ready to get feedback for a project they can turn on comments. If they just want to show people what they've done (and perhaps get feedback from their own circle of friends) then they can leave the comments off. There's been a number of people who got some really weird demands, abuse, and other unsavoury comment behaviour... and just quit their projects. eg. this is one project which quit Another person removed more than 20 projects after getting some nasty comments from anonymous strangers on the internet.

4) Comments only from other makers

Additionally, I think it might be a good idea to only allow people who have posted a project to post comments on other peoples projects. This will stop the drive by trolls, and make comments more a discussion amongst peers. Gathering useful feedback, and having constructive criticism is a great thing. I guess this will be an optional thing as well, since whilst feedback from peers is often of a higher quality, hearing from others is also very useful.
[comments allowed options]
   - no comments on this project
   - comments from other project owners only
   - comments from everyone with an account.

This will also be a nice signal that the pygame community is about making things, and that we place importance on making things.
"If you want to comment on this project, you first have to share a project of your own".

5) Reactions, ratings, awards, and stars/favourites

The ludumdare, and pyweek systems have multiple ratings for different aspects of a project. They ask for feedback on particular things. Sound, fun, innovation, production... etc. So I'd like to store those for comments.

Again, this will be optional for projects. Each will have a [Seek feedback.] option.
Feedback like this will make giving more useful comments easier.

Additionally, a 'didn't work for me' option people can click can let people provide that feedback easily without polluting the comments too much.
[didn't work for me] [on which OS][stack trace]
Whilst pointing out defects is useful, it can also tend towards annoying nitpicking and turn into unwanted bikeshed arguments. It also can get in the way of more long form thoughtful discussion.
Awards are fun too(as used in ludumdare/pyweek), like "best duck main character".

Favourites, and stars are useful for keeping a list of ones you personally are interested in. They're useful for following projects. Also for "which projects do other people like".

6) Social auth logins

I've also added fields to projects and user profiles for linking up your twitter username, your bitbucket, and github urls. It's useful to know github/bitbucket links for projects. This allows downloading change information from there, and even releases. Much like how the pygame community dashboard brings information in from dozens of different social platforms, I want to allow projects to have that too.

More importantly, people who want to form teams or work on their projects with others will be able to ask for contributors (or even know where to find the project!)

Allowing people to just enter their github/bitbucket/twitter/etc user names means they don't need to link their accounts for signup. However, letting people use these will allow people to join more quickly. For those truly too lazy to enter in an email address ;)

7) Putting the Python Code of Conduct in front

Putting the Python Code of Conduct in front is another conscious decision. Which in short says to respect each other, and don't be mean. It says the whole python community, along with the pygame community expects to be able to participate in a friendly constructive manner. So it's right there on the front page.
"Leave a thoughtful comment"
The messaging, and branding also tries to suggest people to be thoughtful. Rather than have a "submit comment" button we have a "leave thoughtful comment" button. It's a little thing, but hopefully it signals to people that they should play nice.

Multi coloured branding

8) How to write good criticism?

I'd like to be able to point people to articles on how to do good criticism of both software, and of arts projects. What makes good feedback? What makes a good review?

Is the purpose of a review to nitpick? Is it to help energize people, to recognize people for their work?

Articles like On Giving Feedback I want to link to.
"When it was my work being critiqued, it made me excited to push my design and thinking forward."
I'd to point out reviews of a quality, as good examples. Writing reviews is an art form in itself. My time writing arts reviews really helped me working in creative fields, as much as receiving reviews. It really is a different thing to review a creative piece, compared to reviewing a purely functional piece.

Do you know any good articles on feedback and review we should share?

Monday, March 06, 2017

Pixel perfect collision detection in pygame with masks.

"BULLSHIT! That bullet didn't even hit me!" they cried as the space ship starts to play the destruction animation, and Player 1 life counter drops by one. Similar cries of BULLSHIT! are heard all over the world as thousands of people lose an imaginary life to imperfect collision detection every day.

Do you want random people on the internet to cry bullshit at your game? Well do ya punk?

Bounding boxes are used by many games to detect if two things collide. Either a rectangle, a circle, a box or a sphere are used as a crude way to check if two things collide. However for many games that just isn't enough. Players can see that something didn't collide, so they are going to be crying foul if you just use bounding boxes.

Pygame added fast and easy pixel perfect collision detection. So no more bullshit collisions ok?

Code to go along with this article can be found here ( ).

Why rectangles aren't good enough.

Here are some screen shots of a little balloon game I made modeled after an old commodore 64 game I typed in when I was eight.  Here you can see a balloon, and a cave.  The idea is you have to move the baloon through the cave without hitting the walls.  Now if you used just bounding rectangle collisions, you will see how it would not work, and how the game would be no fun - because the rectangle(drawn in green around the balloon) would hit the sides when the balloon didn't really hit the sides.

You can download the balloon mini game code to have a look at with this article.

How is pixel perfect collision detection done? Masks.

Instead of using 8-32bits per pixel, pygames masks use only 1 bit per pixel. This makes it very quick to check for collisions. As you can compare 32 pixels with one integer compare. Masks use bounding box collision first - to speed things up.
Even though bounding boxes are a crude approximation for collisions, they are faster than using bitmasks. So pygame first does a check to see if the rectangles collide - then if the rectangles do collide, only then does it check to see if the pixels collide.

How to use pixel perfect collision detection in pygame?

There are a couple of ways you can use pixel perfect collision detection with pygame.
  • Creating masks from surfaces.
  • Using the pygame.sprite classes.
You can create a mask from any surfaces with transparency.  So you load up your images normally, and then create the masks for them.
Or you can use the pygame.sprite classes, which handle some of the complexity for you.

Mask.from_surface with Alpha transparency.

By default pygame uses either color keys, or per pixel alpha values to see which parts of an image are converted into the mask.
Color keyed images have either 100% transparent or fully visible pixels. Where as per pixel alpha images have 255 levels of transparency. By default pygame uses 50% transparent pixels as on, or ones that are to collide with.
It's a good idea to pre-calcuate the mask, so you do not need to generate it every frame.

Checking if one mask overlaps another mask.

It is fairly simple to see if one mask overlaps another mask.
Say we have two masks (a and b), and also a rect for where each of the masks is.

#We calculate the offset of the second mask relative to the first mask.
offset_x = a_rect[0] - b_rect[0]
offset_y = a_rect[1] - b_rect[1]
# See if the two masks at the offset are overlapping.
overlap = a.overlap(b, (offset_x, offset_y))
if overlap:
    print "the two masks overlap!"

Pixel perfect collision detection with pygame.sprite classes.

The pygame.sprite classes are a high level way to display your images.  They provide things like collision detection, layers, groups and lots of other goodies.

Note: that comes with this article uses sprites with masks.

If you give your sprites a .mask attribute then they can use the built in collision detection functions that come with pygame.sprite.
class Balloon(pygame.sprite.Sprite):
    def __init__(self):
        pygame.sprite.Sprite.__init__(self) #call Sprite initializer
        self.image, self.rect = pygame.image.load("balloon.png")
        self.mask = pygame.mask.from_surface(self.image)

b1 = Balloon()
b2 = Balloon()

if pygame.sprite.spritecollide(b1, b2, False, pygame.sprite.collide_mask):
    print "sprites have collided!"

Collision response - approximate collision normal.

Once two things collide, what happens next?  Maybe one of these things...
One of the things blows up, disappears, or does a dying animation.
Both things disappear.
Both things bounce off each other.
One thing bounces, the other thing stays. If something going to be bouncing, and not just disappearing, then we need to figure out the direction the two masks collided.  This direction of collision we will call a collision normal.
Using just the masks, we can not find the exact collision normal, so we find an approximation.  Often times in games, we don't need to find an exact solution, just something that looks kind of right.
Using an offset in the x direction, and the y direction, we find the difference in overlapped areas between the two masks.  This gives us the vector (dx, dy), which we use as the collision normal.
If you understand vector maths, you can add this normal to the velocity of the first moving object, and subtract it from the other moving object.

def collision_normal(left_mask, right_mask, left_pos, right_pos):

    def vadd(x, y):
        return [x[0]+y[0],x[1]+y[1]]

    def vsub(x, y):
        return [x[0]-y[0],x[1]-y[1]]

    def vdot(x, y):
        return x[0]*y[0]+x[1]*y[1]

    offset = list(map(int, vsub(left_pos, right_pos)))
    overlap = left_mask.overlap_area(right_mask, offset)
    if overlap == 0:
    """Calculate collision normal"""
    nx = (left_mask.overlap_area(right_mask,(offset[0]+1,offset[1])) -
    ny = (left_mask.overlap_area(right_mask,(offset[0],offset[1]+1)) -
    if nx == 0 and ny == 0:
        """One sprite is inside another"""
    n = [nx, ny]
    return n

Fun uses for masks.

Here's a few fun ideas that you could implement with masks, and pixel perfect collision detection.
  • A balloon game, where the bit masks are created from nicely drawn levels - which are then turned into bitmasks for pixel perfect collision detection.  No need to worry about slicing the level up, or manually specifying the collision rectangles, just draw the level and create a mask out of it. Here's a screen shot from the balloon code that comes with this article:

  • A platform game where the ground is not made out of platforms, so much as pixels. So you could have curvy ground, or single pixel things the characters could stand on.
  • Mouse cursor hit detection. Turn your mouse cursor into something, and rather than have a single pixel hit, instead have the hit be any pixel under the mouse cursor.
  • "Worms" style exploding terrain.

Friday, February 24, 2017

setup.cfg - a solution to python config file soup? A howto guide.

Sick of config file soup cluttering up your repo? Me too. However there is a way to at least clean it up for many python tools.

Some of the tools you might use and the config files they support...
  • flake8 - .flake8, setup.cfg, tox.ini, and config/flake8 on Windows
  • pytest - pytest.ini, tox.ini, setup.cfg
  • - .coveragerc, setup.cfg, tox.ini
  • mypy - setup.cfg, mypy.ini
  • tox - tox.ini
 Can mypy use setup.cfg as well?
OK, you've convinced me. -- Guido

With that mypy now also supports setup.cfg, and we can all remove many more config files.

The rules for precedence are easy:
  1. read --config-file option - if it's incorrect, exit
  2. read [tool].ini - if correct, stop
  3. read setup.cfg


How to config with setup.cfg?

Here's a list to the configuration documentation for setup.cfg.

What does a setup.cfg look like now?

Here's an example setup.cfg for you with various tools configured. (note these are nonsensical example configs, not what I suggest you use!)

#timid = True

# [tool:pytest]
# addopts=-v --cov pygameweb pygameweb/ tests/

#python_version = 2.7

#max-line-length = 120
#max-complexity = 10
#exclude = build,dist,docs/,somepackage/migrations,*.egg-info

## Run with: pylint --rcfile=setup.cfg somepackage
#disable = C0103,C0111
#ignore = migrations
#ignore-docstrings = yes
#output-format = colorized

Monday, February 20, 2017

Is Type Tracing for Python useful? Some experiments.

Type Tracing - as a program runs you trace it and record the types of variables coming in and out of functions, and being assigned to variables.
Is Type Tracing useful for providing quality benefits, documentation benefits, porting benefits, and also speed benefits to real python programs?

Python is now a gradually typed language, meaning that you can gradually apply types and along with type inference statically check your code is correct. Once you have added types to everything, you can catch quite a lot of errors. For several years I've been using the new type checking tools that have been popping up in the python ecosystem. I've given talks to user groups about them, and also trained people to use them. I think a lot of people are using these tools without even realizing it. They see in their IDE warnings about type issues, and methods are automatically completed for them.

But I've always had some thoughts in the back of my head about recording types at runtime of a program in order to help the type inference out (and to avoid having to annotate them manually yourself).

Note, that this technique is a different, but related thing to what is done in a tracing jit compiler.
Some days ago I decided to try Type Tracing out... and I was quite surprised by the results.

I asked myself these questions.

  • Can I store the types coming in and out of python functions, and the types assigned to variables in order to be useful for other things based on tracing the running of a program? (Yes)
  • Can I "Type Trace" a complex program? (Yes, a flask+sqlalchemy app test suite runs)
  • Is porting python 2 code quicker by Type Tracing combined with static type checking, documentation generation, and test generation? (Yes, refactoring is safer with a type checker and no manually written tests)
  • Can I generate better documentation automatically with Type Tracing? (Yes, return and parameter types and example values helps understanding greatly)
  • Can I use the types for automatic property testing? (Yes, hypothesis does useful testing just knowing some types and a few examples... which we recorded with the tracer)
  • Can I use example capture for tests and docs, as well as the types? (Yes)
  • Can I generate faster compiled code automatically just using the recorded types and Cython (Yes).

Benefits from Type Tracing.

Below I try to show that the following benefits can be obtained by combining Type Tracing with other existing python tools.
  • Automate documentation generation, by providing types to the documentation tool, and by collecting some example inputs and outputs.
  • Automate some type annotation.
  • Automatically find bugs static type checking can not. Without full type inference, existing python static type checkers can not find many issues until the types are fully annotated. Type Tracing can provide those types.
  • Speed up Python2 porting process, by finding issues other tools can't. It can also speed things up by showing people types and example inputs. This can greatly help people understand large programs when documentation is limited.
  • Use for Ahead Of Time (AOT) compilation with Cython.
  • Help property testing tools to find simple bugs without manually setting properties.

Tools used to hack something together.

  • coverage (extended the coverage checker to record types as it goes) 
  • mypy (static type checker for python)
  • Hypothesis (property testing... automated test generator)
  • Cython (a compiler for python code, and code with type annotations)
  • jedi (another python static type checker)
  • Sphinx (automatic documentation generator).
  • Cpython (the original C implementation of python)
More details below on the experiments.

Type Tracing using 'coverage'.

Originally I hacked up a set_trace script... and started going. But there really are so many corner cases. Also, I already run the "coverage" tool over the code base I'm working on.

I started with coverage.pytracer.PyTracer, since it's python. Coverage also comes with a faster tracer written in C. So far I'm just using the python one.

The plan later would be to perhaps use CoverageData. Which uses JSON, which means storing the type will be hard sometimes (eg, when they are dynamically generated). However, I think I'm happy to start with easy types. To start simple, I'll just record object types as strings with something like `repr(type(o)) if type(o) is not type else repr(o)`. Well, I'm not sure. So far, I'm happy with hacking everything into my fork of coverage, but to move it into production there is more work to be done. Things like multiprocess, multithreading all need to be handled.

Porting python 2 code with type tracing.

I first started porting code to python 3 in the betas... around 2007. Including some C API modules. I think I worked on one of the first single code base packages. Since then the tooling has gotten a lot better. Compatibility libraries exist (six), lots of people have figured out the dangerous points and documented them. Forward compatibility features were added into the python2.6 and 2.7, and 3.5 releases to make porting easier. However, it can still be hard.

Especially when Python 2 code bases often don't have many tests. Often zero tests. Also, there may be very little documentation, and the original developers have moved on.

But the code works, and it's been in production for a long time, and gets updates occasionally. Maybe it's not updated as often as it's needed because people are afraid of breaking things.

Steps to port to python 3 are usually these:

  1. Understand the code.
  2. Run the code in production (or on a copy of production data).
  3. With a debugger, look at what is coming in and out of functions.
  4. Write tests for everything.
  5. Write documentation.
  6. Run 2to3.
  7. Do lots of manual QA.
  8. Start refactoring.
  9. Repeat. Repeat manually writing tests, docs, and testing manually. Many times.
Remember that writing tests is usually harder than writing the code in the first place.

With type tracing helping to generate docs, types for the type checker, examples for human reading plus for the hypothesis property checker we get a lot more tools to help ensure quality.

A new way to port python2 code could be something like...
  1. Run program under Type Tracing, line/branch coverage, and example capture.
  2. Look at generated types, example inputs and outputs.
  3. Look at generated documentation.
  4. Gradually add type checking info with help of Type Tracing recorded types.
  5. Generate tests automatically with Type Tracing types, examples, and hypothesis automated property testing. Generate empty test stubs for things you still need to test.
  6. Once each module is fully typed, you can statically type check it.
  7. You can cross validate your type checked python code against your original code. Under the Type Tracer.
  8. Refactoring is easier with better docs, static type checks, tests, types for arguments and return values, and example inputs and outputs.
  9. Everything should be ported to work with the new forwards compatibility functionality in python2.7.
  10. Now with your various quality checks in place, you can start porting to python3. Note, you might not have needed to change any of the original code - only add types.
I would suggest the effort is about 1/5th of the normal time it takes to port things. Especially if you want to make sure the chance of introducing errors is very low.

Below are a couple of issues where Type Tracing can help over existing tools.

Integer divide issue.

Here I will show that the 2to3 conversion tool makes a bug with. Also, mypy does not detect a problem with the code.

def int_problem(x):
    return x / 4

$ python2
$ python3

$ mypy --py2
$ mypy
$ 2to3
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
RefactoringTool: Skipping optional fixer: ws_comma
RefactoringTool: Refactored
---    (original)
+++    (refactored)
@@ -3,4 +3,4 @@
 def int_problem(x):
     return x / 4

RefactoringTool: Files that need to be modified:

See how when run under python3 it gives a different result?

Can we fix it when Type Tracing adds types?  (Yes)

So, how about if we run the program under type tracing, and record the input types coming in and out? See how it adds a python3 compatible comment about taking an int, and returning an int. This is so that mypy (and other type checkers) can see what it is supposed to take in.
def int_problem(x):
    # type: (int) -> int
    return x / 4
$ mypy error: Incompatible return value type (got "float", expected "int")
I'm happy that Yes, Type Tracing combined with mypy can detect this issue whereas mypy can not by itself.

Binary or Text file issue?

Another porting issue not caught by existing tools is trying to do the right thing when a python file is in binary mode or in text mode. If in binary, read() will return bytes, otherwise it might return text.

In theory this could be made to work, however at the time of writing, there is an open issue with "dependent types" or "Factory Pattern" functions in mypy. More information on this, and also a work around I wrote see this issue:

In there I show that you can create your own replacement that always returns one type. eg, open_rw(fname) instead of open(fname, 'rw').

Once you know that .read() will return bytes, then you also know that it can't call .format() in python 3. The solution is to use % string formatting on bytes, which is supported from python3.5 upwards.

x = # type: bytes

So the answer here is that mypy could likely solve this issue by itself in the future (once things are fully type annotated). But for now, it's good to see combining type tracing with mypy could help detect binary and text encoding issues much faster.

Generating Cython code with recorded types.

I wanted to see if this was possible. So I took the simple example from the cython documentation.

I used my type tracer to transform this python:
def f(x):
    return x**2-x

def integrate_f(a, b, N):
    s = 0
    dx = (b-a)/N
    for i in range(N):
        s += f(a+i*dx)
    return s * dx

Before you look below... take a guess what parameters a, b, and N are? Note, how there are no comments. Note how the variable names are single letter. Note, how there are no tests. There are no examples.

In [2]: %timeit integrate_f(10.4, 2.3, 17)
100000 loops, best of 3: 5.12 µs per loop

Into this Cython code with annotated types after running it through Type Tracing:
In [1]: %load_ext Cython

In [2]: %%cython
   ...: cdef double f(double x):
   ...:     return x**2-x
   ...: def integrate_f_c(double a, double b, int N):
   ...:     """
   ...:     :Example:
   ...:     >>> integrate_f_c(10.4, 2.3, 17)
   ...:     -342.34804152249137
   ...:     """
   ...:     cdef int i
   ...:     cdef double s, dx
   ...:     s = 0
   ...:     dx = (b-a)/N
   ...:     for i in range(N):
   ...:         s += f(a+i*dx)
   ...:     return s * dx 

In [3]: %timeit integrate_f_c(10.4, 2.3, 17)

10000000 loops, best of 3: 117 ns per loop
Normal python was 5200 nanoseconds. The cython compiled version is 117 nanoseconds.  The result is 44x faster code, and we have all the types annotated, with an example. This helps you understand it a little better than before too.

This was a great result for me. It shows that yes combining Type Tracing with Cython can give improvements over Cython just by itself. Note, that Cython is not only for speeding up simple numeric code. It's also been used to speed up string based code, database access, network access, and game code.

So far I've made a simple mapping of python types to cython types. To make the code more useful would require quite a bit more effort. However, if you use it as a tool to help you write cython code yourself, then it's very useful to speed up that process.

The best cases so far are when it knows all of the types, all of the types have direct cython mappings, and it avoids calling python functions inside the function. In other words, 'pure' functions.

Cross validation for Cython and python versions?

In a video processing project I worked on there were implementations in C, and other assembly implementations of the same functions. A very simple way of testing is to run all the implementations and compare the results. If the C implementation gives the same results as the assembly implementations, then there's a pretty good chance they are correct.

In [1]:  assert integrate_f_c(10.4, 2.3, 17) == integrate_f(10.4, 2.3, 17)

If we have a test runner, we can check if the inputs and outputs are the same between the compiled code and the non compiled code. That is, cross validate implementations against each other for correctness.

Property testing.

The most popular property testing framework Quickcheck from the Haskell world. However, python also has an implementation - Hypothesis. Rather than supply examples, as is usual with unit testing you tell it about properties which hold true.

Can we generate a hypothesis test automatically using just types collected with Type Tracing?

Below we can see some unit tests (example based testing), as well as some Hypothesis tests (property testing). They are for a function "always_add_something(x)", which always adds something to the number given in. As a property, we would say that "always_add_something(x) > x".  That property will hold to be true for every value of x given x is an int.

Note, that the program is fully typed, and passes type checking with mypy. Also note that there is 100% test coverage if I remove the divide by zero error I inserted.

from hypothesis import given
import hypothesis.strategies

from bad_logic_issue import always_add_something, always_add_something_good

def test_always_add_something():# type: () -> None
    #type: () -> None
    assert always_add_something(5) >= 5
    assert always_add_something(200) >= 200

def test_always_add_something_good():
    #type: () -> None
    assert always_add_something_good(5) >= 5
    assert always_add_something_good(200) >= 200

def test_always_add_something(x):
    assert always_add_something(x) > x

# Here we test the good one.
def test_always_add_something(x):
    assert always_add_something_good(x) > x
Here are two implementations of the function. The first one is a contrived example in order to show two types of logic errors that are quite common. Even 30 year old code used by billions of people has been shown to have these errors. They're sort of hard to find with normal testing methods.

def always_add_something(x):
    # type: (int) -> int
    '''Silly function that is supposed to always add something to x.

    But it doesn't always... even though we have
     - 'complete' test coverage.
     - fully typed
    r = x #type: int
    if x > 0 and x < 10:
        r += 20
    elif x > 15 and x < 30:
        r //= 0
    elif x > 100:
        r += 30

    return r

def always_add_something_good(x):
    # type: (int) -> int
    '''This one always does add something.
    return x + 1

Now, hypothesis can find the errors when you write the property that the return value needs to be greater than the input. What about if we just use the types we record with Type Tracing to give hypothesis a chance to test? Hypothesis comes with a number of test strategies which generate many variations of a type. Eg, there is an "integers" strategy.

# Will it find an error just telling hypothesis that it takes an int as input?
def test_always_add_something(x):

It finds the divide by zero issue (when x is 16). However it does not find the other issue, because it still does not know that there is a problem. We haven't told it anything about the result always needing to be greater than the input. ZeroDivisionError
-------------------------------------------------------- Hypothesis --------------------------------------------------------
Falsifying example: test_always_add_something(x=16)
The result is that yes, it could find one issue automatically, without having to write any extra test code, just from Trace Typing.

For pure functions, it would be also useful to record some examples for unit test generation.

In conclusion.

I'm happy with the experiment overall. I think it shows it can be a fairly useful technique for making python programs more understandable, faster, and more correct. It can also help speed up porting old python2 code dramatically (especially when that code has limited documentation and tests).

I think the experiment also shows that combining existing python tools (coverage, mypy, Cython, and hypothesis) can give some interesting extra abilities without not too much extra effort. eg. I didn't need to write a robust tracing module, I didn't need to write a static type checker, or a python compiler. However, it would take some effort to turn these into robust general purpose tools. Currently what I have is a collection of fragile hacks, without support for many corner cases :)

For now I don't plan to work on this any more in the short term. (Unless of course someone wants to hire me to port some python2 code. Then I'll work on these tools again since it speeds things up quite a lot).

Any corrections or suggestions? Please leave a comment, or see you on twitter @renedudfield

Monday, February 13, 2017

Is PostgreSQL good enough?

tldr; you can do jobs, queues, real time change feeds, time series, object store, document store, full text search with PostgreSQL. How to, pros/cons, rough performance and complexity levels are all discussed. Many sources and relevant documentation is linked to.

Your database is first. But can PostgreSQL be second?

Web/app projects these days often have many distributed parts. It's not uncommon for groups to use the right tool for the job. The right tools are often something like the choice below.
  • Redis for queuing, and caching.
  • Elastic Search for searching, and log stash.
  • Influxdb or RRD for timeseries.
  • S3 for an object store.
  • PostgreSQL for relational data with constraints, and validation via schemas.
  • Celery for job queues.
  • Kafka for a buffer of queues or stream processing.
  • Exception logging with PostgreSQL (perhaps using Sentry)
  • KDB for low latency analytics on your column oriented data.
  • Mongo/ZODB for storing documents JSON (or mangodb for /dev/null replacement) 
  • SQLite for embedded. 
  • Neo4j for graph databases.
  • RethinkDB for your realtime data, when data changes, other parts 'react'.
  • ...
For all the different nodes this could easily cost thousands a month, require lots of ops knowledge and support, and use up lots of electricity. To set all this up from scratch could cost one to four weeks of developer time depending on if they know the various stacks already. Perhaps you'd have ten nodes to support.

Could you gain an ops advantage by using only PostgreSQL? Especially at the beginning when your system isn't all that big, and your team size is small, and your requirements not extreme? Only one system to setup, monitor, backup, install, upgrade, etc.

This article is my humble attempt to help people answer the question...

Is PostgreSQL good enough?

Can it be 'good enough' for all sorts of different use cases? Or do I need to reach into another toolbox?

Every project is different, and often the requirements can be different. So this question by itself is impossible to answer without qualifiers. Many millions of websites and apps in the world have very few users (less than thousands per month), they might need to handle bursty traffic at 100x the normal rate some times. They might need interactive, or soft realtime performance requirements for queries and reports. It's really quite difficult to answer the question conclusively for every use case, and for every set of requirements. I will give some rough numbers and point to case studies, and external benchmarks for each section.

Most websites and apps don't need to handle 10 million visitors a month, or have 99.999% availability when 95% availability will do, ingest 50 million metric rows per day, or do 400,000 jobs per second, or query over TB's of data with sub millisecond response times.

Tool choice.

I've used a LOT of different databases over time. CDB, Elastic Search, Redis, SAP (is it a db or a COBOL?), BSDDB/GDBM, SQLite... Even written some where the requirements were impossible to match with off the shelf systems and we had to make them ourselves (real time computer vision processing of GB/second in from the network). Often PostgreSQL simply couldn't do the job at hand (or mysql was installed already, and the client insisted). But sometimes PostgreSQL was just merely not the best tool for the job.

A Tool Chest
Recently I read a book about tools. Woodworking tools, not programming tools. The whole philosophy of the book is a bit much to convey here... but The Anarchist's Tool Chest is pretty much all about tool choice (it's also a very fine looking book, that smells good too). One lesson it teaches is about when selecting a plane (you know the things for stripping wood). There are dozens of different types perfect for specific situations. There's also some damn good general purpose planes, and if you just select a couple of good ones you can get quite a lot done. Maybe not the best tool for the job, but at least you will have room for them in your tool chest. On the other hand, there are also swiss army knives, and 200 in one tools off teevee adverts. I'm pretty sure PostgreSQL is some combination of a minimal tool choice and the swiss army knife tool choice in the shape of a big blue solid elephant.

PostgreSQL is an elephant sized tool chest that holds a LOT of tools.

Batteries included?

Does PostgreSQL come with all the parts for full usability? Often the parts are built in, but maybe a bit complicated, but not everything is built in. But luckily there are some good libraries which make the features more usable ("for humans").

For from scratch people, I'll link to the PostgreSQL documentation. I'll also link to already made systems which already use PostgreSQL for (queues, time series, graphs, column stores, document data bases), which you might be able to use for your needs. This article will slanted towards the python stack, but there are definitely alternatives in the node/ruby/perl/java universes. If not, I've listed the PostgreSQL parts and other open source implementations so you can roll your own.

By learning a small number of PostgreSQL commands, it may be possible to use 'good enough' implementations yourself. You might be surprised at what other things you can implement by combining these techniques together. 

Task, or job queues.

Recent versions of PostgeSQL support a couple of useful technologies for efficient and correct queues.

First is the LISTEN/NOTIFY. You can LISTEN for events, and have clients be NOTIFY'd when they happen. So your queue workers don't have to keep polling the database all the time. They can get NOTIFIED when things happen.

The recent addition in 9.5 of the SKIP LOCKED locking clause to PostgreSQL SELECT, enables efficient queues to be written when you have multiple writers and readers. It also means that a queue implementation can be correct [2].

Finally 9.6 saw plenty of VACUUM performance enhancements which help out with queues.

Batteries included?

A very popular job and task system is celery. It can support various SQL backends, including PostgreSQL through sqlalchemy and the Django ORM. [ED: version 4.0 of celery doesn't have pg support]

A newer, and smaller system is called pq. It sort of models itself off the redis python 'rq' queue API. However, with pq you can have a transactional queue. Which is nice if you want to make sure other things are committed AND your job is in the queue. With a separate system this is a bit harder to guarantee.

Is it fast enough? pq states in its documentation that you can do 1000 jobs per second per core... but on my laptop it did around 2000. In the talk "Can elephants queue?" 10,000 messages per second are mentioned with eight clients.

More reading.

Full text search.

Full text search Searching the full text of the document, and not just the metadata.
PostgreSQL has had full text search for quite a long time as a separate extension, and now it is built in. Recently, it's gotten a few improvements which I think now make it "good enough" for many uses.

The big improvement in 9.6 is phrase search. So if I search for "red hammer" I get things which have both of them - not things that are red, and things that are a hammer. It can also return documents where the first word is red, and then five words later hammer appears.

One other major thing that elastic search does is automatically create indexes on all the fields. You add a document, and then you can search it. That's all you need to do. PostgreSQL is quite a lot more manual than that. You need to tell it which fields to index, and update the index with a trigger on changes (see triggers for automatic updates).  But there are some libraries which make things much easier. One of them is sqlalchemy_searchable. However, I'm not aware of anything as simple and automatic as elastic search here.
  • What about faceted search? These days it's not so hard to do at speed. [6][7]
  • What about substring search on an index (fast LIKE)? It can be made fast with a trigram index. [8][9]
  • Stemming? Yes. [11
  • "Did you mean" fuzzy matching support? Yes. [11
  • Accent support? (My name is René, and that last é breaks sooooo many databases). Yes. [11]
  • Multiple languages? Yes. [11]
  • Regex search when you need it? Yes. [13]
If your main data store is PostgreSQL and you export your data into Elasticsearch (you should NOT use elastic search as the main store, since it still crashes sometimes), then that's also extra work you need to do. With elastic search you also need to manually set weighting of different fields if you want the search to work well. So in the end it's a similar amount of work.

Using the right libraries, I think it's a similar amount of work overall with PostgreSQL. Elasticsearch is still easier initially. To be fair Lucene (which elasticsearch is based on) is a much more advanced text searching system.

What about the speed? They are index searches, and return fast - as designed. At [1] they mention that the speed is ok for 1-2 million documents. They also mention 50ms search time. It's also possible to make replicas for read queries if you don't want to put the search load on your main database. There is another report for searches taking 15ms [10]. Note that elastic search often takes 3-5ms for a search on that same authors hardware. Also note, that the new asyncpg PostgreSQL driver gives significant latency improvements for general queries like this (35ms vs 2ms) [14].

Hybrid searches (relational searches combined with full text search) is another thing that PostgreSQL makes pretty easy. Say you wanted to ask "Give me all companies who have employees who wrote research papers, stack overflow answers, github repos written with the text 'Deep Learning' where the authors live with within 50km of Berlin. PostgreSQL could do those joins fairly efficiently for you.

The other massive advantage of PostgreSQL is that you can keep the search index in sync. The search index can be updated in the same transaction. So your data is consistent, and not out of date. It can be very important for some applications to return the most recent data.

How about searching across multiple human natural languages at once? PostgreSQL allows you to efficiently join across multiple language search results. So if you type "red hammer" into a German hardware website search engine, you can actually get some results.

Anyone wanting more in-depth information should read or watch this FTS presentation [15] from last year. It's by some of the people who has done a lot of work on the implementation, and talks about 9.6 improvements, current problems, and things we might expect to see in version 10. There is also a blog post [16] with more details about various improvements in 9.6 to FTS.

You can see the RUM index extension (which has faster ranking) at

More reading.

Time series.

Data points with timestamps.
Time series databases are used a lot for monitoring. Either for monitoring server metrics (like cpu load) or for monitoring sensors and all other manner of things. Perhaps sensor data, or any other IoT application you can think of.

RRDtool from the late 90s.
 To do efficient queries of data over say a whole month or even a year, you need to aggregate the values into smaller buckets. Either minute, hour, day, or month sized buckets. Some data is recorded at such a high frequency, that doing an aggregate (sum, total, ...) of all that data would take quite a while.

Round robin databases don't even store all the raw data, but put things into a circular buffer of time buckets. This saves a LOT of disk space.

The other thing time series databases do is accept a large amount of this type of data. To efficiently take in a lot of data, you can use things like COPY IN, rather than lots of individual inserts, or use SQL arrays of data. In the future (PostgreSQL 10), you should be able to use logical replication to have multiple data collectors.

Materialized views can be handy to have a different view of the internal data structures. To make things easier to query.

date_trunc can be used to truncate a timestamp into the bucket size you want. For example SELECT date_trunc('hour', timestamp) as timestamp.

Array functions, and binary types can be used to store big chunks of data in a compact form for processing later. Many time series databases do not need to know the latest results, and some time lag is good enough.

A BRIN index (new in 9.5) can be very useful for time queries. Selecting between two times on a field indexed with BRIN is much quicker.  "We managed to improve our best case time by a factor of 2.6 and our worst case time by a factor of 30" [7]. As long as the rows are entered roughly in time order [6]. If they are not for some reason you can reorder them on disk with the CLUSTER command -- however, often time series data comes in sorted by time.

Monasca can provide graphana and API, and Monasca queries PostgreSQL. There's still no direct support in grapha for PostgreSQL, however work has been in progress for quite some time. See the pull request in grafana.

Another project which uses time series in PostgreSQL is Tgres. It's compatible with statsd, graphite text for input, and provides enough of the Graphite HTTP API to be usable with Grafana. The author also blogs[1] a lot about different optimal approaches to use for time series databases.

See this talk by Steven Simpson at the fosdem conference about infrastructure monitoring with PostgreSQL. In it he talks about using PostgreSQL to monitor and log a 100 node system.

In an older 'grisha' blog post [5], he states "I was able to sustain a load of ~6K datapoints per second across 6K series" on a 2010 laptop.

Can we get the data into a dataframe structure for analysis easily? Sure, if you are using sqlalchemy and pandas dataframes, you can load dataframes like this...  
df = pd.read_sql(query.statement, query.session.bind)
This lets you unleash some very powerful statistics, and machine learning tools on your data. (there's also a to_sql).

Some more reading.

Object store for binary data. 

Never store images in your database!
I'm sure you've heard it many times before. But what if your images are your most important data? Surely they deserve something better than a filesystem? What if they need to be accessed from more than one web application server? The solution to this problem is often to store things in some cloud based storage like S3.

BYTEA is the type to use for binary data in PostgreSQL if the size is less than 1GB.
    id serial primary key,
    filename text not null,
    data bytea not null
Note, however, that streaming the file is not really supported with BYTEA by all PostgreSQL drivers. It needs to be entirely in memory.

However, many images are only 200KB or up to 10MB in size. Which should be fine even if you get hundreds of images added per day. A three year old laptop benchmark for you... Saving 2500 1MB iPhone sized images with python and psycopg2 takes about 1 minute and 45 seconds, just using a single core. (That's 2.5GB of data). It can be made 3x faster by using COPY IN/TO BINARY [1], however that is more than fast enough for many uses.

If you need really large objects, then PostgreSQL has something called "Large Objects". But these aren't supported by some backup tools without extra configuration.

Batteries included? Both the python SQL libraries (psycopg2, and sqlalchemy) have builtin support for BYTEA.

But how do you easily copy files out of the database and into it? I made a image save and get gist here to save and get files with a 45 line python script. It's even easier when you use an ORM, since the data is just an attribute (open('bla.png').write(

A fairly important thing to consider with putting gigabytes of binary data into your PostgreSQL is that it will affect the backup/restore speed of your other data. This isn't such a problem if you have a hot spare replica, have point in time recovery(with WALL-e, pgbarman), use logical replication, or decide to restore selective tables.

How about speed? I found it faster to put binary data into PostgreSQL compared to S3. Especially on low CPU clients (IoT), where you have to do full checksums of the data before sending it on the client side to S3. This also depends on the geographical location of S3 you are using, and your network connections to it.

S3 also provides other advantages and features (like built in replication, and it's a managed service). But for storing a little bit of binary data, I think PostgreSQL is good enough. Of course if you want a highly durable globally distributed object store with very little setup then things like S3 are first.

More reading.

Realtime, pubsub, change feeds, Reactive.

Change feeds are a feed you can listen to for changes.  The pubsub (or Publish–subscribe pattern), can be done with LISTEN / NOTIFY and TRIGGER.

Implement You've Got Mail functionality.
This is quite interesting if you are implementing 'soft real time' features on your website or apps. If something happens to your data, then your application can 'immediately' know about it.  Websockets is the name of the web technology which makes this perform well, however HTTP2 also allows server push, and various other systems have been in use for a long time before both of these. Say you were making a chat messaging website, and you wanted to make a "You've got mail!" sound. Your Application can LISTEN to PostgreSQL, and when some data is changed a TRIGGER can send a NOTIFY event which PostgreSQL passes to your application, your application can then push the event to the web browser.

PostgreSQL can not give you hard real time guarantees unfortunately. So custom high end video processing and storage systems, or specialized custom high speed financial products are not domains PostgreSQL is suited.

How well does it perform? In the Queue section, I mentioned thousands of events per core on an old laptop.

Issues for latency are the query planner and optimizer, and VACUUM, and ANALYZE.

The query planner is sort of amazing, but also sort of annoying. It can automatically try and figure out the best way to query data for you. However, it doesn't automatically create an index where it might think one would be good. Depending on environmental factors, like how much CPU, IO, data in various tables and other statistics it gathers, it can change the way it searches for data. This is LOTS better than having to write your queries by hand, and then updating them every time the schema, host, or amount of data changes.

But sometimes it gets things wrong, and that isn't acceptable when you have performance requirements. William Stein (from the Sage Math project) wrote about some queries mysteriously some times being slow at [7]. This was after porting his web app to use PostgreSQL instead of rethinkdb (TLDR; the port was possible and the result faster). The solution is usually to monitor those slow queries, and try to force the query planner to follow a path that you know is fast. Or to add/remove or tweak the index the query may or may not be using. Brady Holt wrote a good article on "Performance Tuning Queries in PostgreSQL".

Later on I cover the topic of column databases, and 'real time' queries over that type of data popular in financial and analytic products (pg doesn't have anything built in yet, but extensions exist).

VACUUM ANALYZE is a process that cleans things up with your data. It's a garbage collector (VACUUM) combined with a statistician (ANALYZE). It seems every release of PostgreSQL improves the performance for various corner cases. It used to have to be run manually, and now automatic VACUUM is a thing. Many more things can be done concurrently, and it can avoid having to read all the data in many more situations. However, sometimes, like with all garbage collectors it makes pauses. On the plus side, it can make your data smaller and inform itself about how to make faster queries. If you need to, you can turn off the autovacuum, and do things more manually. Also, you can just do the ANALYZE part to gather statistics, which can run much faster than VACUUM.

To get better latency with python and PostgreSQL, there is asyncpg by magicstack. Which uses an asynchronous network model (python 3.5+), and the binary PostgreSQL protocol. This can have 2ms query times and is often faster than even golang, and nodejs. It also lets you read in a million rows per second from PostgreSQL to python per core [8]. Memory allocations are reduced, as is context switching - both things that cause latency.

For these reasons, I think it's "good enough" for many soft real time uses, where the occasional time budget failure isn't the end of the world. If you load test your queries on real data (and for more data than you have), then you can be fairly sure it will work ok most of the time. Selecting the appropriate client side driver can also give you significant latency improvements.

More reading.

Log storage and processing

Being able to have your logs in a central place for queries, and statistics is quite helpful. But so is grepping through logs. Doing relational or even full text queries on them is even better.

rsyslog allows you to easily send your logs to a PostgeSQL database [1]. You set it up so that it stores the logs in files, but sends them to your database as well. This means if the database goes down for a while, the logs are still there. The rsyslog documentation has a section on high speed logging by using buffering on the rsyslog side [4].

systemd is the more modern logging system, and it allows logging to remote locations with systemd-journal-remote. It sends JSON lines over HTTPS. You can take the data in with systemd (using it as a buffer) and then pipe it into PostgreSQL with COPY at high rates. The other option is to use the systemd support for sending logs to traditional syslogs like rsyslog, which can send it into a PostgreSQL.

Often you want to grep your logs. SELECT regex matches can be used for grep/grok like functionality. It can also be used to parse your logs into a table format you can more easily query.

TRIGGER can be used to parse the data every time a log entry is inserted. Or you can use MATERIALIZED VIEWs if you don't need to refresh the information as often.

Is it fast enough? See this talk by Steven Simpson at the fosdem conference about infrastructure monitoring with PostgreSQL. In it he talks about using PostgreSQL to monitor and log a 100 node system. PostgreSQL on a single old laptop can quite happy ingest at a rate in the hundreds of thousands of messages per second range. Citusdata is an out of core solution which builds on PostgreSQL(and contributes to it ya!). It is being used to process billions of events, and is used by some of the largest companies on the internet (eg. Cloudflare with 5% of internet traffic uses it for logging). So PostgreSQL can scale up too(with out of core extensions).

Batteries included? In the timeseries database section of this article, I mentioned that you can use grafana with PostgreSQL (with some effort). You can use this for dashboards, and alerting (amongst other things). However, I don't know of any really good systems (Sentry, Datadog, elkstack) which have first class PostgreSQL support out of the box.

One advantage of having your logs in there is that you can write custom queries quite easily. Want to know how many requests per second from App server 1 there were, and link it up to your slow query log? That's just a normal SQL query, and you don't need to have someone grep through the logs... normal SQL tools can be used. When you combine this functionality with existing SQL analytics tools, this is quite nice.

I think it's good enough for many small uses. If you've got more than 100 nodes, or are doing a lot of events, it might not be the best solution (unless you have quite a powerful PostgreSQL cluster). It does take a bit more work, and it's not the road most traveled. However it does let you use all the SQL analytics tools with one of the best metrics and alerting systems.

More reading.

Queue for collecting data

When you have traffic bursts, it's good to persist the data quickly, so that you can queue up processing for later. Perhaps you normally get only 100 visitors per day, but then some news article comes out or your website is mentioned on the radio (or maybe spammers strike) -- this is bursty traffic.

Storing data, for processing later is things that systems like Kafka excel at.
 Using the COPY command, rather than lots of separate inserts can give you a very nice speedup for buffering data. If you do some processing on the data, or have constraints and indexes, all these things slow it down. So instead you can just put it in a normal table, and then process the data like you would with a queue.

A lot of the notes for Log storage, and Queuing apply here. I guess you're starting to see a pattern? We've been able to use a few building blocks to implement efficient patterns that allow us to use PostgreSQL which might have required specialized databases in the past.

The fastest way to get data into PostgreSQL from python? See this answer [1] where 'COPY {table} FROM STDIN WITH BINARY' is shown to be the fastest way.

More reading.

High availability, elasticity.

“Will the database always be there for you? Will it grow with you?”
To get things going quickly there are a number of places which offer PostgreSQL as a service [3][4][5][6][7][8]. So you can get them to setup replication, monitoring, scaling, backups, and software updates for you.

The Recovery Point Objective (RPO), and Recovery Time Objective (RTO) are different for every project. Not all projects require extreme high availability. For some, it is fine to have the recovery happen hours or even a week later. Other projects can not be down for more than a few minutes or seconds at a time. I would argue that for many non-critical websites a hot standby and offsite backup will be 'good enough'.

I would highly recommend this talk by Gunnar Bluth - "An overview of PostgreSQL's backup, archiving, and replication". However you might want to preprocess the sound with your favourite sound editor (eg. Audacity) to remove the feedback noise. The slides are there however with no ear destroying feedback sounds.

By using a hot standby secondary replication you get the ability to quickly fail over from your main database. So you can be back up within minutes or seconds. By using pgbarman or wall-e, you get point in time recovery offsite backup of the database. To make managing the replicas easier, a tool like repmgr can come in handy.

Having really extreme high availability with PostgreSQL is currently kind of hard, and requires out of core solutions. It should be easier in version 10.0 however.

Patroni is an interesting system which helps you deploy a high availability cluster on AWS (with Spilo which is used in production), and work is in progress so that it works on Kubernetes clusters. Spilo is currently being used in production and can do various management tasks, like auto scaling, backups, node replacement on failure. It can work with a minimum of three nodes.

As you can see there are multiple systems, and multiple vendors that help you scale PostgreSQL. On the low end, you can have backups of your database to S3 for cents per month, and a hotstandby replica for $5/month. You can also scale a single node all the way up to a machine with 24TB of storage, 32 cores and 244GB of memory. That's not in the same range as casandra installations with thousands of nodes, but it's still quite an impressive range.

More reading.

Column store, graph databases, other databases, ... finally The End?

This article is already way too long... so I'll go quickly over these two topics.

Graph databases like Neo4j allow you to do complex graph queries. Edges, nodes, and hierarchies. How to do that in PostgreSQL? Denormalise the data, and use a path like attribute and LIKE. So to find things in a graph, say all the children, you can pre-compute the path inside a string, rather than do complex recursive queries and joins using foreign keys.
SELECT * FROM nodes WHERE path LIKE '/parenta/child2/child3%';
Then you don't need super complex queries to get the graph structure from parent_id, child_ids and such. (Remember before how you can put a trigram index for fast LIKEs?) You can also use other pattern matching queries on this path, to do things like find all the parents up to 3 levels high that have a child.

Tagging data with a fast LIKE becomes very easy as well. Just store the tags in a comma separated field and use an index on it.

Column stores are where the data is stored in a column layout, instead of in rows. Often used for real time analytic work loads. One the oldest and best of these is Kdb+. Google made one, Druid is another popular one, and there are also plenty of custom ones used in graphics.

But doesn't PostgreSQL store everything in row based format? Yes it does. However, there is an open source extension called cstore_fdw by Citus Data which is a column-oriented store for PostgreSQL.

So how fast is it? There is a great series of articles by Mark Litwintschik, where he benchmarks a billion taxi ride data set with PostgreSQL and with kdb+ and various other systems. Without cstore_fdw, or parallel workers PostgreSQL took 3.5 hours to do a query. With 4 parallel workers, it was reduced to 1 hour and 1 minute. With cstore_fdw it took 2 minutes and 32 seconds. What a speed up!

The End.

I'm sorry that was so long. But it could have been way longer. It's not my fault...

PostgreSQL carries around such a giant Tool Chest.

Hopefully all these words may be helpful next time you want to use PostgreSQL for something outside of relational data. Also, I hope you can see that it can be possible to replace 10 database systems with just one, and that by doing so you can a gain significant ops advantage.

Any corrections or suggestions? Please leave a comment, or see you on twitter @renedudfield
There was discussion on hn and python reddit.