Monday, February 20, 2017

Is Type Tracing for Python useful? Some experiments.

Type Tracing - as a program runs you trace it and record the types of variables coming in and out of functions, and being assigned to variables.
Is Type Tracing useful for providing quality benefits, documentation benefits, porting benefits, and also speed benefits to real python programs?

Python is now a gradually typed language, meaning that you can gradually apply types and along with type inference statically check your code is correct. Once you have added types to everything, you can catch quite a lot of errors. For several years I've been using the new type checking tools that have been popping up in the python ecosystem. I've given talks to user groups about them, and also trained people to use them. I think a lot of people are using these tools without even realizing it. They see in their IDE warnings about type issues, and methods are automatically completed for them.

But I've always had some thoughts in the back of my head about recording types at runtime of a program in order to help the type inference out (and to avoid having to annotate them manually yourself).

Note, that this technique is a different, but related thing to what is done in a tracing jit compiler.
Some days ago I decided to try Type Tracing out... and I was quite surprised by the results.

I asked myself these questions.

  • Can I store the types coming in and out of python functions, and the types assigned to variables in order to be useful for other things based on tracing the running of a program? (Yes)
  • Can I "Type Trace" a complex program? (Yes, a flask+sqlalchemy app test suite runs)
  • Is porting python 2 code quicker by Type Tracing combined with static type checking, documentation generation, and test generation? (Yes, refactoring is safer with a type checker and no manually written tests)
  • Can I generate better documentation automatically with Type Tracing? (Yes, return and parameter types and example values helps understanding greatly)
  • Can I use the types for automatic property testing? (Yes, hypothesis does useful testing just knowing some types and a few examples... which we recorded with the tracer)
  • Can I use example capture for tests and docs, as well as the types? (Yes)
  • Can I generate faster compiled code automatically just using the recorded types and Cython (Yes).

Benefits from Type Tracing.

Below I try to show that the following benefits can be obtained by combining Type Tracing with other existing python tools.
  • Automate documentation generation, by providing types to the documentation tool, and by collecting some example inputs and outputs.
  • Automate some type annotation.
  • Automatically find bugs static type checking can not. Without full type inference, existing python static type checkers can not find many issues until the types are fully annotated. Type Tracing can provide those types.
  • Speed up Python2 porting process, by finding issues other tools can't. It can also speed things up by showing people types and example inputs. This can greatly help people understand large programs when documentation is limited.
  • Use for Ahead Of Time (AOT) compilation with Cython.
  • Help property testing tools to find simple bugs without manually setting properties.

Tools used to hack something together.

  • coverage (extended the coverage checker to record types as it goes) 
  • mypy (static type checker for python)
  • Hypothesis (property testing... automated test generator)
  • Cython (a compiler for python code, and code with type annotations)
  • jedi (another python static type checker)
  • Sphinx (automatic documentation generator).
  • Cpython (the original C implementation of python)
More details below on the experiments.

Type Tracing using 'coverage'.

Originally I hacked up a set_trace script... and started going. But there really are so many corner cases. Also, I already run the "coverage" tool over the code base I'm working on.

I started with coverage.pytracer.PyTracer, since it's python. Coverage also comes with a faster tracer written in C. So far I'm just using the python one.

The plan later would be to perhaps use CoverageData. Which uses JSON, which means storing the type will be hard sometimes (eg, when they are dynamically generated). However, I think I'm happy to start with easy types. To start simple, I'll just record object types as strings with something like `repr(type(o)) if type(o) is not type else repr(o)`. Well, I'm not sure. So far, I'm happy with hacking everything into my fork of coverage, but to move it into production there is more work to be done. Things like multiprocess, multithreading all need to be handled.

Porting python 2 code with type tracing.

I first started porting code to python 3 in the betas... around 2007. Including some C API modules. I think I worked on one of the first single code base packages. Since then the tooling has gotten a lot better. Compatibility libraries exist (six), lots of people have figured out the dangerous points and documented them. Forward compatibility features were added into the python2.6 and 2.7, and 3.5 releases to make porting easier. However, it can still be hard.

Especially when Python 2 code bases often don't have many tests. Often zero tests. Also, there may be very little documentation, and the original developers have moved on.

But the code works, and it's been in production for a long time, and gets updates occasionally. Maybe it's not updated as often as it's needed because people are afraid of breaking things.

Steps to port to python 3 are usually these:

  1. Understand the code.
  2. Run the code in production (or on a copy of production data).
  3. With a debugger, look at what is coming in and out of functions.
  4. Write tests for everything.
  5. Write documentation.
  6. Run 2to3.
  7. Do lots of manual QA.
  8. Start refactoring.
  9. Repeat. Repeat manually writing tests, docs, and testing manually. Many times.
Remember that writing tests is usually harder than writing the code in the first place.

With type tracing helping to generate docs, types for the type checker, examples for human reading plus for the hypothesis property checker we get a lot more tools to help ensure quality.

A new way to port python2 code could be something like...
  1. Run program under Type Tracing, line/branch coverage, and example capture.
  2. Look at generated types, example inputs and outputs.
  3. Look at generated documentation.
  4. Gradually add type checking info with help of Type Tracing recorded types.
  5. Generate tests automatically with Type Tracing types, examples, and hypothesis automated property testing. Generate empty test stubs for things you still need to test.
  6. Once each module is fully typed, you can statically type check it.
  7. You can cross validate your type checked python code against your original code. Under the Type Tracer.
  8. Refactoring is easier with better docs, static type checks, tests, types for arguments and return values, and example inputs and outputs.
  9. Everything should be ported to work with the new forwards compatibility functionality in python2.7.
  10. Now with your various quality checks in place, you can start porting to python3. Note, you might not have needed to change any of the original code - only add types.
I would suggest the effort is about 1/5th of the normal time it takes to port things. Especially if you want to make sure the chance of introducing errors is very low.

Below are a couple of issues where Type Tracing can help over existing tools.

Integer divide issue.

Here I will show that the 2to3 conversion tool makes a bug with. Also, mypy does not detect a problem with the code.

def int_problem(x):
    return x / 4

$ python2
$ python3

$ mypy --py2
$ mypy
$ 2to3
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
RefactoringTool: Skipping optional fixer: ws_comma
RefactoringTool: Refactored
---    (original)
+++    (refactored)
@@ -3,4 +3,4 @@
 def int_problem(x):
     return x / 4

RefactoringTool: Files that need to be modified:

See how when run under python3 it gives a different result?

Can we fix it when Type Tracing adds types?  (Yes)

So, how about if we run the program under type tracing, and record the input types coming in and out? See how it adds a python3 compatible comment about taking an int, and returning an int. This is so that mypy (and other type checkers) can see what it is supposed to take in.
def int_problem(x):
    # type: (int) -> int
    return x / 4
$ mypy error: Incompatible return value type (got "float", expected "int")
I'm happy that Yes, Type Tracing combined with mypy can detect this issue whereas mypy can not by itself.

Binary or Text file issue?

Another porting issue not caught by existing tools is trying to do the right thing when a python file is in binary mode or in text mode. If in binary, read() will return bytes, otherwise it might return text.

In theory this could be made to work, however at the time of writing, there is an open issue with "dependent types" or "Factory Pattern" functions in mypy. More information on this, and also a work around I wrote see this issue:

In there I show that you can create your own replacement that always returns one type. eg, open_rw(fname) instead of open(fname, 'rw').

Once you know that .read() will return bytes, then you also know that it can't call .format() in python 3. The solution is to use % string formatting on bytes, which is supported from python3.5 upwards.

x = # type: bytes

So the answer here is that mypy could likely solve this issue by itself in the future (once things are fully type annotated). But for now, it's good to see combining type tracing with mypy could help detect binary and text encoding issues much faster.

Generating Cython code with recorded types.

I wanted to see if this was possible. So I took the simple example from the cython documentation.

I used my type tracer to transform this python:
def f(x):
    return x**2-x

def integrate_f(a, b, N):
    s = 0
    dx = (b-a)/N
    for i in range(N):
        s += f(a+i*dx)
    return s * dx

Before you look below... take a guess what parameters a, b, and N are? Note, how there are no comments. Note how the variable names are single letter. Note, how there are no tests. There are no examples.

In [2]: %timeit integrate_f(10.4, 2.3, 17)
100000 loops, best of 3: 5.12 µs per loop

Into this Cython code with annotated types after running it through Type Tracing:
In [1]: %load_ext Cython

In [2]: %%cython
   ...: cdef double f(double x):
   ...:     return x**2-x
   ...: def integrate_f_c(double a, double b, int N):
   ...:     """
   ...:     :Example:
   ...:     >>> integrate_f_c(10.4, 2.3, 17)
   ...:     -342.34804152249137
   ...:     """
   ...:     cdef int i
   ...:     cdef double s, dx
   ...:     s = 0
   ...:     dx = (b-a)/N
   ...:     for i in range(N):
   ...:         s += f(a+i*dx)
   ...:     return s * dx 

In [3]: %timeit integrate_f_c(10.4, 2.3, 17)

10000000 loops, best of 3: 117 ns per loop
Normal python was 5200 nanoseconds. The cython compiled version is 117 nanoseconds.  The result is 44x faster code, and we have all the types annotated, with an example. This helps you understand it a little better than before too.

This was a great result for me. It shows that yes combining Type Tracing with Cython can give improvements over Cython just by itself. Note, that Cython is not only for speeding up simple numeric code. It's also been used to speed up string based code, database access, network access, and game code.

So far I've made a simple mapping of python types to cython types. To make the code more useful would require quite a bit more effort. However, if you use it as a tool to help you write cython code yourself, then it's very useful to speed up that process.

The best cases so far are when it knows all of the types, all of the types have direct cython mappings, and it avoids calling python functions inside the function. In other words, 'pure' functions.

Cross validation for Cython and python versions?

In a video processing project I worked on there were implementations in C, and other assembly implementations of the same functions. A very simple way of testing is to run all the implementations and compare the results. If the C implementation gives the same results as the assembly implementations, then there's a pretty good chance they are correct.

In [1]:  assert integrate_f_c(10.4, 2.3, 17) == integrate_f(10.4, 2.3, 17)

If we have a test runner, we can check if the inputs and outputs are the same between the compiled code and the non compiled code. That is, cross validate implementations against each other for correctness.

Property testing.

The most popular property testing framework Quickcheck from the Haskell world. However, python also has an implementation - Hypothesis. Rather than supply examples, as is usual with unit testing you tell it about properties which hold true.

Can we generate a hypothesis test automatically using just types collected with Type Tracing?

Below we can see some unit tests (example based testing), as well as some Hypothesis tests (property testing). They are for a function "always_add_something(x)", which always adds something to the number given in. As a property, we would say that "always_add_something(x) > x".  That property will hold to be true for every value of x given x is an int.

Note, that the program is fully typed, and passes type checking with mypy. Also note that there is 100% test coverage if I remove the divide by zero error I inserted.

from hypothesis import given
import hypothesis.strategies

from bad_logic_issue import always_add_something, always_add_something_good

def test_always_add_something():# type: () -> None
    #type: () -> None
    assert always_add_something(5) >= 5
    assert always_add_something(200) >= 200

def test_always_add_something_good():
    #type: () -> None
    assert always_add_something_good(5) >= 5
    assert always_add_something_good(200) >= 200

def test_always_add_something(x):
    assert always_add_something(x) > x

# Here we test the good one.
def test_always_add_something(x):
    assert always_add_something_good(x) > x
Here are two implementations of the function. The first one is a contrived example in order to show two types of logic errors that are quite common. Even 30 year old code used by billions of people has been shown to have these errors. They're sort of hard to find with normal testing methods.

def always_add_something(x):
    # type: (int) -> int
    '''Silly function that is supposed to always add something to x.

    But it doesn't always... even though we have
     - 'complete' test coverage.
     - fully typed
    r = x #type: int
    if x > 0 and x < 10:
        r += 20
    elif x > 15 and x < 30:
        r //= 0
    elif x > 100:
        r += 30

    return r

def always_add_something_good(x):
    # type: (int) -> int
    '''This one always does add something.
    return x + 1

Now, hypothesis can find the errors when you write the property that the return value needs to be greater than the input. What about if we just use the types we record with Type Tracing to give hypothesis a chance to test? Hypothesis comes with a number of test strategies which generate many variations of a type. Eg, there is an "integers" strategy.

# Will it find an error just telling hypothesis that it takes an int as input?
def test_always_add_something(x):

It finds the divide by zero issue (when x is 16). However it does not find the other issue, because it still does not know that there is a problem. We haven't told it anything about the result always needing to be greater than the input. ZeroDivisionError
-------------------------------------------------------- Hypothesis --------------------------------------------------------
Falsifying example: test_always_add_something(x=16)
The result is that yes, it could find one issue automatically, without having to write any extra test code, just from Trace Typing.

For pure functions, it would be also useful to record some examples for unit test generation.

In conclusion.

I'm happy with the experiment overall. I think it shows it can be a fairly useful technique for making python programs more understandable, faster, and more correct. It can also help speed up porting old python2 code dramatically (especially when that code has limited documentation and tests).

I think the experiment also shows that combining existing python tools (coverage, mypy, Cython, and hypothesis) can give some interesting extra abilities without not too much extra effort. eg. I didn't need to write a robust tracing module, I didn't need to write a static type checker, or a python compiler. However, it would take some effort to turn these into robust general purpose tools. Currently what I have is a collection of fragile hacks, without support for many corner cases :)

For now I don't plan to work on this any more in the short term. (Unless of course someone wants to hire me to port some python2 code. Then I'll work on these tools again since it speeds things up quite a lot).

Any corrections or suggestions? Please leave a comment, or see you on twitter @renedudfield

Monday, February 13, 2017

Is PostgreSQL good enough?

tldr; you can do jobs, queues, real time change feeds, time series, object store, document store, full text search with PostgreSQL. How to, pros/cons, rough performance and complexity levels are all discussed. Many sources and relevant documentation is linked to.

Your database is first. But can PostgreSQL be second?

Web/app projects these days often have many distributed parts. It's not uncommon for groups to use the right tool for the job. The right tools are often something like the choice below.
  • Redis for queuing, and caching.
  • Elastic Search for searching, and log stash.
  • Influxdb or RRD for timeseries.
  • S3 for an object store.
  • PostgreSQL for relational data with constraints, and validation via schemas.
  • Celery for job queues.
  • Kafka for a buffer of queues or stream processing.
  • Exception logging with PostgreSQL (perhaps using Sentry)
  • KDB for low latency analytics on your column oriented data.
  • Mongo/ZODB for storing documents JSON (or mangodb for /dev/null replacement) 
  • SQLite for embedded. 
  • Neo4j for graph databases.
  • RethinkDB for your realtime data, when data changes, other parts 'react'.
  • ...
For all the different nodes this could easily cost thousands a month, require lots of ops knowledge and support, and use up lots of electricity. To set all this up from scratch could cost one to four weeks of developer time depending on if they know the various stacks already. Perhaps you'd have ten nodes to support.

Could you gain an ops advantage by using only PostgreSQL? Especially at the beginning when your system isn't all that big, and your team size is small, and your requirements not extreme? Only one system to setup, monitor, backup, install, upgrade, etc.

This article is my humble attempt to help people answer the question...

Is PostgreSQL good enough?

Can it be 'good enough' for all sorts of different use cases? Or do I need to reach into another toolbox?

Every project is different, and often the requirements can be different. So this question by itself is impossible to answer without qualifiers. Many millions of websites and apps in the world have very few users (less than thousands per month), they might need to handle bursty traffic at 100x the normal rate some times. They might need interactive, or soft realtime performance requirements for queries and reports. It's really quite difficult to answer the question conclusively for every use case, and for every set of requirements. I will give some rough numbers and point to case studies, and external benchmarks for each section.

Most websites and apps don't need to handle 10 million visitors a month, or have 99.999% availability when 95% availability will do, ingest 50 million metric rows per day, or do 400,000 jobs per second, or query over TB's of data with sub millisecond response times.

Tool choice.

I've used a LOT of different databases over time. CDB, Elastic Search, Redis, SAP (is it a db or a COBOL?), BSDDB/GDBM, SQLite... Even written some where the requirements were impossible to match with off the shelf systems and we had to make them ourselves (real time computer vision processing of GB/second in from the network). Often PostgreSQL simply couldn't do the job at hand (or mysql was installed already, and the client insisted). But sometimes PostgreSQL was just merely not the best tool for the job.

A Tool Chest
Recently I read a book about tools. Woodworking tools, not programming tools. The whole philosophy of the book is a bit much to convey here... but The Anarchist's Tool Chest is pretty much all about tool choice (it's also a very fine looking book, that smells good too). One lesson it teaches is about when selecting a plane (you know the things for stripping wood). There are dozens of different types perfect for specific situations. There's also some damn good general purpose planes, and if you just select a couple of good ones you can get quite a lot done. Maybe not the best tool for the job, but at least you will have room for them in your tool chest. On the other hand, there are also swiss army knives, and 200 in one tools off teevee adverts. I'm pretty sure PostgreSQL is some combination of a minimal tool choice and the swiss army knife tool choice in the shape of a big blue solid elephant.

PostgreSQL is an elephant sized tool chest that holds a LOT of tools.

Batteries included?

Does PostgreSQL come with all the parts for full usability? Often the parts are built in, but maybe a bit complicated, but not everything is built in. But luckily there are some good libraries which make the features more usable ("for humans").

For from scratch people, I'll link to the PostgreSQL documentation. I'll also link to already made systems which already use PostgreSQL for (queues, time series, graphs, column stores, document data bases), which you might be able to use for your needs. This article will slanted towards the python stack, but there are definitely alternatives in the node/ruby/perl/java universes. If not, I've listed the PostgreSQL parts and other open source implementations so you can roll your own.

By learning a small number of PostgreSQL commands, it may be possible to use 'good enough' implementations yourself. You might be surprised at what other things you can implement by combining these techniques together. 

Task, or job queues.

Recent versions of PostgeSQL support a couple of useful technologies for efficient and correct queues.

First is the LISTEN/NOTIFY. You can LISTEN for events, and have clients be NOTIFY'd when they happen. So your queue workers don't have to keep polling the database all the time. They can get NOTIFIED when things happen.

The recent addition in 9.5 of the SKIP LOCKED locking clause to PostgreSQL SELECT, enables efficient queues to be written when you have multiple writers and readers. It also means that a queue implementation can be correct [2].

Finally 9.6 saw plenty of VACUUM performance enhancements which help out with queues.

Batteries included?

A very popular job and task system is celery. It can support various SQL backends, including PostgreSQL through sqlalchemy and the Django ORM. [ED: version 4.0 of celery doesn't have pg support]

A newer, and smaller system is called pq. It sort of models itself off the redis python 'rq' queue API. However, with pq you can have a transactional queue. Which is nice if you want to make sure other things are committed AND your job is in the queue. With a separate system this is a bit harder to guarantee.

Is it fast enough? pq states in its documentation that you can do 1000 jobs per second per core... but on my laptop it did around 2000. In the talk "Can elephants queue?" 10,000 messages per second are mentioned with eight clients.

More reading.

Full text search.

Full text search Searching the full text of the document, and not just the metadata.
PostgreSQL has had full text search for quite a long time as a separate extension, and now it is built in. Recently, it's gotten a few improvements which I think now make it "good enough" for many uses.

The big improvement in 9.6 is phrase search. So if I search for "red hammer" I get things which have both of them - not things that are red, and things that are a hammer. It can also return documents where the first word is red, and then five words later hammer appears.

One other major thing that elastic search does is automatically create indexes on all the fields. You add a document, and then you can search it. That's all you need to do. PostgreSQL is quite a lot more manual than that. You need to tell it which fields to index, and update the index with a trigger on changes (see triggers for automatic updates).  But there are some libraries which make things much easier. One of them is sqlalchemy_searchable. However, I'm not aware of anything as simple and automatic as elastic search here.
  • What about faceted search? These days it's not so hard to do at speed. [6][7]
  • What about substring search on an index (fast LIKE)? It can be made fast with a trigram index. [8][9]
  • Stemming? Yes. [11
  • "Did you mean" fuzzy matching support? Yes. [11
  • Accent support? (My name is René, and that last é breaks sooooo many databases). Yes. [11]
  • Multiple languages? Yes. [11]
  • Regex search when you need it? Yes. [13]
If your main data store is PostgreSQL and you export your data into Elasticsearch (you should NOT use elastic search as the main store, since it still crashes sometimes), then that's also extra work you need to do. With elastic search you also need to manually set weighting of different fields if you want the search to work well. So in the end it's a similar amount of work.

Using the right libraries, I think it's a similar amount of work overall with PostgreSQL. Elasticsearch is still easier initially. To be fair Lucene (which elasticsearch is based on) is a much more advanced text searching system.

What about the speed? They are index searches, and return fast - as designed. At [1] they mention that the speed is ok for 1-2 million documents. They also mention 50ms search time. It's also possible to make replicas for read queries if you don't want to put the search load on your main database. There is another report for searches taking 15ms [10]. Note that elastic search often takes 3-5ms for a search on that same authors hardware. Also note, that the new asyncpg PostgreSQL driver gives significant latency improvements for general queries like this (35ms vs 2ms) [14].

Hybrid searches (relational searches combined with full text search) is another thing that PostgreSQL makes pretty easy. Say you wanted to ask "Give me all companies who have employees who wrote research papers, stack overflow answers, github repos written with the text 'Deep Learning' where the authors live with within 50km of Berlin. PostgreSQL could do those joins fairly efficiently for you.

The other massive advantage of PostgreSQL is that you can keep the search index in sync. The search index can be updated in the same transaction. So your data is consistent, and not out of date. It can be very important for some applications to return the most recent data.

How about searching across multiple human natural languages at once? PostgreSQL allows you to efficiently join across multiple language search results. So if you type "red hammer" into a German hardware website search engine, you can actually get some results.

Anyone wanting more in-depth information should read or watch this FTS presentation [15] from last year. It's by some of the people who has done a lot of work on the implementation, and talks about 9.6 improvements, current problems, and things we might expect to see in version 10. There is also a blog post [16] with more details about various improvements in 9.6 to FTS.

You can see the RUM index extension (which has faster ranking) at

More reading.

Time series.

Data points with timestamps.
Time series databases are used a lot for monitoring. Either for monitoring server metrics (like cpu load) or for monitoring sensors and all other manner of things. Perhaps sensor data, or any other IoT application you can think of.

RRDtool from the late 90s.
 To do efficient queries of data over say a whole month or even a year, you need to aggregate the values into smaller buckets. Either minute, hour, day, or month sized buckets. Some data is recorded at such a high frequency, that doing an aggregate (sum, total, ...) of all that data would take quite a while.

Round robin databases don't even store all the raw data, but put things into a circular buffer of time buckets. This saves a LOT of disk space.

The other thing time series databases do is accept a large amount of this type of data. To efficiently take in a lot of data, you can use things like COPY IN, rather than lots of individual inserts, or use SQL arrays of data. In the future (PostgreSQL 10), you should be able to use logical replication to have multiple data collectors.

Materialized views can be handy to have a different view of the internal data structures. To make things easier to query.

date_trunc can be used to truncate a timestamp into the bucket size you want. For example SELECT date_trunc('hour', timestamp) as timestamp.

Array functions, and binary types can be used to store big chunks of data in a compact form for processing later. Many time series databases do not need to know the latest results, and some time lag is good enough.

A BRIN index (new in 9.5) can be very useful for time queries. Selecting between two times on a field indexed with BRIN is much quicker.  "We managed to improve our best case time by a factor of 2.6 and our worst case time by a factor of 30" [7]. As long as the rows are entered roughly in time order [6]. If they are not for some reason you can reorder them on disk with the CLUSTER command -- however, often time series data comes in sorted by time.

Monasca can provide graphana and API, and Monasca queries PostgreSQL. There's still no direct support in grapha for PostgreSQL, however work has been in progress for quite some time. See the pull request in grafana.

Another project which uses time series in PostgreSQL is Tgres. It's compatible with statsd, graphite text for input, and provides enough of the Graphite HTTP API to be usable with Grafana. The author also blogs[1] a lot about different optimal approaches to use for time series databases.

See this talk by Steven Simpson at the fosdem conference about infrastructure monitoring with PostgreSQL. In it he talks about using PostgreSQL to monitor and log a 100 node system.

In an older 'grisha' blog post [5], he states "I was able to sustain a load of ~6K datapoints per second across 6K series" on a 2010 laptop.

Can we get the data into a dataframe structure for analysis easily? Sure, if you are using sqlalchemy and pandas dataframes, you can load dataframes like this...  
df = pd.read_sql(query.statement, query.session.bind)
This lets you unleash some very powerful statistics, and machine learning tools on your data. (there's also a to_sql).

Some more reading.

Object store for binary data. 

Never store images in your database!
I'm sure you've heard it many times before. But what if your images are your most important data? Surely they deserve something better than a filesystem? What if they need to be accessed from more than one web application server? The solution to this problem is often to store things in some cloud based storage like S3.

BYTEA is the type to use for binary data in PostgreSQL if the size is less than 1GB.
    id serial primary key,
    filename text not null,
    data bytea not null
Note, however, that streaming the file is not really supported with BYTEA by all PostgreSQL drivers. It needs to be entirely in memory.

However, many images are only 200KB or up to 10MB in size. Which should be fine even if you get hundreds of images added per day. A three year old laptop benchmark for you... Saving 2500 1MB iPhone sized images with python and psycopg2 takes about 1 minute and 45 seconds, just using a single core. (That's 2.5GB of data). It can be made 3x faster by using COPY IN/TO BINARY [1], however that is more than fast enough for many uses.

If you need really large objects, then PostgreSQL has something called "Large Objects". But these aren't supported by some backup tools without extra configuration.

Batteries included? Both the python SQL libraries (psycopg2, and sqlalchemy) have builtin support for BYTEA.

But how do you easily copy files out of the database and into it? I made a image save and get gist here to save and get files with a 45 line python script. It's even easier when you use an ORM, since the data is just an attribute (open('bla.png').write(

A fairly important thing to consider with putting gigabytes of binary data into your PostgreSQL is that it will affect the backup/restore speed of your other data. This isn't such a problem if you have a hot spare replica, have point in time recovery(with WALL-e, pgbarman), use logical replication, or decide to restore selective tables.

How about speed? I found it faster to put binary data into PostgreSQL compared to S3. Especially on low CPU clients (IoT), where you have to do full checksums of the data before sending it on the client side to S3. This also depends on the geographical location of S3 you are using, and your network connections to it.

S3 also provides other advantages and features (like built in replication, and it's a managed service). But for storing a little bit of binary data, I think PostgreSQL is good enough. Of course if you want a highly durable globally distributed object store with very little setup then things like S3 are first.

More reading.

Realtime, pubsub, change feeds, Reactive.

Change feeds are a feed you can listen to for changes.  The pubsub (or Publish–subscribe pattern), can be done with LISTEN / NOTIFY and TRIGGER.

Implement You've Got Mail functionality.
This is quite interesting if you are implementing 'soft real time' features on your website or apps. If something happens to your data, then your application can 'immediately' know about it.  Websockets is the name of the web technology which makes this perform well, however HTTP2 also allows server push, and various other systems have been in use for a long time before both of these. Say you were making a chat messaging website, and you wanted to make a "You've got mail!" sound. Your Application can LISTEN to PostgreSQL, and when some data is changed a TRIGGER can send a NOTIFY event which PostgreSQL passes to your application, your application can then push the event to the web browser.

PostgreSQL can not give you hard real time guarantees unfortunately. So custom high end video processing and storage systems, or specialized custom high speed financial products are not domains PostgreSQL is suited.

How well does it perform? In the Queue section, I mentioned thousands of events per core on an old laptop.

Issues for latency are the query planner and optimizer, and VACUUM, and ANALYZE.

The query planner is sort of amazing, but also sort of annoying. It can automatically try and figure out the best way to query data for you. However, it doesn't automatically create an index where it might think one would be good. Depending on environmental factors, like how much CPU, IO, data in various tables and other statistics it gathers, it can change the way it searches for data. This is LOTS better than having to write your queries by hand, and then updating them every time the schema, host, or amount of data changes.

But sometimes it gets things wrong, and that isn't acceptable when you have performance requirements. William Stein (from the Sage Math project) wrote about some queries mysteriously some times being slow at [7]. This was after porting his web app to use PostgreSQL instead of rethinkdb (TLDR; the port was possible and the result faster). The solution is usually to monitor those slow queries, and try to force the query planner to follow a path that you know is fast. Or to add/remove or tweak the index the query may or may not be using. Brady Holt wrote a good article on "Performance Tuning Queries in PostgreSQL".

Later on I cover the topic of column databases, and 'real time' queries over that type of data popular in financial and analytic products (pg doesn't have anything built in yet, but extensions exist).

VACUUM ANALYZE is a process that cleans things up with your data. It's a garbage collector (VACUUM) combined with a statistician (ANALYZE). It seems every release of PostgreSQL improves the performance for various corner cases. It used to have to be run manually, and now automatic VACUUM is a thing. Many more things can be done concurrently, and it can avoid having to read all the data in many more situations. However, sometimes, like with all garbage collectors it makes pauses. On the plus side, it can make your data smaller and inform itself about how to make faster queries. If you need to, you can turn off the autovacuum, and do things more manually. Also, you can just do the ANALYZE part to gather statistics, which can run much faster than VACUUM.

To get better latency with python and PostgreSQL, there is asyncpg by magicstack. Which uses an asynchronous network model (python 3.5+), and the binary PostgreSQL protocol. This can have 2ms query times and is often faster than even golang, and nodejs. It also lets you read in a million rows per second from PostgreSQL to python per core [8]. Memory allocations are reduced, as is context switching - both things that cause latency.

For these reasons, I think it's "good enough" for many soft real time uses, where the occasional time budget failure isn't the end of the world. If you load test your queries on real data (and for more data than you have), then you can be fairly sure it will work ok most of the time. Selecting the appropriate client side driver can also give you significant latency improvements.

More reading.

Log storage and processing

Being able to have your logs in a central place for queries, and statistics is quite helpful. But so is grepping through logs. Doing relational or even full text queries on them is even better.

rsyslog allows you to easily send your logs to a PostgeSQL database [1]. You set it up so that it stores the logs in files, but sends them to your database as well. This means if the database goes down for a while, the logs are still there. The rsyslog documentation has a section on high speed logging by using buffering on the rsyslog side [4].

systemd is the more modern logging system, and it allows logging to remote locations with systemd-journal-remote. It sends JSON lines over HTTPS. You can take the data in with systemd (using it as a buffer) and then pipe it into PostgreSQL with COPY at high rates. The other option is to use the systemd support for sending logs to traditional syslogs like rsyslog, which can send it into a PostgreSQL.

Often you want to grep your logs. SELECT regex matches can be used for grep/grok like functionality. It can also be used to parse your logs into a table format you can more easily query.

TRIGGER can be used to parse the data every time a log entry is inserted. Or you can use MATERIALIZED VIEWs if you don't need to refresh the information as often.

Is it fast enough? See this talk by Steven Simpson at the fosdem conference about infrastructure monitoring with PostgreSQL. In it he talks about using PostgreSQL to monitor and log a 100 node system. PostgreSQL on a single old laptop can quite happy ingest at a rate in the hundreds of thousands of messages per second range. Citusdata is an out of core solution which builds on PostgreSQL(and contributes to it ya!). It is being used to process billions of events, and is used by some of the largest companies on the internet (eg. Cloudflare with 5% of internet traffic uses it for logging). So PostgreSQL can scale up too(with out of core extensions).

Batteries included? In the timeseries database section of this article, I mentioned that you can use grafana with PostgreSQL (with some effort). You can use this for dashboards, and alerting (amongst other things). However, I don't know of any really good systems (Sentry, Datadog, elkstack) which have first class PostgreSQL support out of the box.

One advantage of having your logs in there is that you can write custom queries quite easily. Want to know how many requests per second from App server 1 there were, and link it up to your slow query log? That's just a normal SQL query, and you don't need to have someone grep through the logs... normal SQL tools can be used. When you combine this functionality with existing SQL analytics tools, this is quite nice.

I think it's good enough for many small uses. If you've got more than 100 nodes, or are doing a lot of events, it might not be the best solution (unless you have quite a powerful PostgreSQL cluster). It does take a bit more work, and it's not the road most traveled. However it does let you use all the SQL analytics tools with one of the best metrics and alerting systems.

More reading.

Queue for collecting data

When you have traffic bursts, it's good to persist the data quickly, so that you can queue up processing for later. Perhaps you normally get only 100 visitors per day, but then some news article comes out or your website is mentioned on the radio (or maybe spammers strike) -- this is bursty traffic.

Storing data, for processing later is things that systems like Kafka excel at.
 Using the COPY command, rather than lots of separate inserts can give you a very nice speedup for buffering data. If you do some processing on the data, or have constraints and indexes, all these things slow it down. So instead you can just put it in a normal table, and then process the data like you would with a queue.

A lot of the notes for Log storage, and Queuing apply here. I guess you're starting to see a pattern? We've been able to use a few building blocks to implement efficient patterns that allow us to use PostgreSQL which might have required specialized databases in the past.

The fastest way to get data into PostgreSQL from python? See this answer [1] where 'COPY {table} FROM STDIN WITH BINARY' is shown to be the fastest way.

More reading.

High availability, elasticity.

“Will the database always be there for you? Will it grow with you?”
To get things going quickly there are a number of places which offer PostgreSQL as a service [3][4][5][6][7][8]. So you can get them to setup replication, monitoring, scaling, backups, and software updates for you.

The Recovery Point Objective (RPO), and Recovery Time Objective (RTO) are different for every project. Not all projects require extreme high availability. For some, it is fine to have the recovery happen hours or even a week later. Other projects can not be down for more than a few minutes or seconds at a time. I would argue that for many non-critical websites a hot standby and offsite backup will be 'good enough'.

I would highly recommend this talk by Gunnar Bluth - "An overview of PostgreSQL's backup, archiving, and replication". However you might want to preprocess the sound with your favourite sound editor (eg. Audacity) to remove the feedback noise. The slides are there however with no ear destroying feedback sounds.

By using a hot standby secondary replication you get the ability to quickly fail over from your main database. So you can be back up within minutes or seconds. By using pgbarman or wall-e, you get point in time recovery offsite backup of the database. To make managing the replicas easier, a tool like repmgr can come in handy.

Having really extreme high availability with PostgreSQL is currently kind of hard, and requires out of core solutions. It should be easier in version 10.0 however.

Patroni is an interesting system which helps you deploy a high availability cluster on AWS (with Spilo which is used in production), and work is in progress so that it works on Kubernetes clusters. Spilo is currently being used in production and can do various management tasks, like auto scaling, backups, node replacement on failure. It can work with a minimum of three nodes.

As you can see there are multiple systems, and multiple vendors that help you scale PostgreSQL. On the low end, you can have backups of your database to S3 for cents per month, and a hotstandby replica for $5/month. You can also scale a single node all the way up to a machine with 24TB of storage, 32 cores and 244GB of memory. That's not in the same range as casandra installations with thousands of nodes, but it's still quite an impressive range.

More reading.

Column store, graph databases, other databases, ... finally The End?

This article is already way too long... so I'll go quickly over these two topics.

Graph databases like Neo4j allow you to do complex graph queries. Edges, nodes, and hierarchies. How to do that in PostgreSQL? Denormalise the data, and use a path like attribute and LIKE. So to find things in a graph, say all the children, you can pre-compute the path inside a string, rather than do complex recursive queries and joins using foreign keys.
SELECT * FROM nodes WHERE path LIKE '/parenta/child2/child3%';
Then you don't need super complex queries to get the graph structure from parent_id, child_ids and such. (Remember before how you can put a trigram index for fast LIKEs?) You can also use other pattern matching queries on this path, to do things like find all the parents up to 3 levels high that have a child.

Tagging data with a fast LIKE becomes very easy as well. Just store the tags in a comma separated field and use an index on it.

Column stores are where the data is stored in a column layout, instead of in rows. Often used for real time analytic work loads. One the oldest and best of these is Kdb+. Google made one, Druid is another popular one, and there are also plenty of custom ones used in graphics.

But doesn't PostgreSQL store everything in row based format? Yes it does. However, there is an open source extension called cstore_fdw by Citus Data which is a column-oriented store for PostgreSQL.

So how fast is it? There is a great series of articles by Mark Litwintschik, where he benchmarks a billion taxi ride data set with PostgreSQL and with kdb+ and various other systems. Without cstore_fdw, or parallel workers PostgreSQL took 3.5 hours to do a query. With 4 parallel workers, it was reduced to 1 hour and 1 minute. With cstore_fdw it took 2 minutes and 32 seconds. What a speed up!

The End.

I'm sorry that was so long. But it could have been way longer. It's not my fault...

PostgreSQL carries around such a giant Tool Chest.

Hopefully all these words may be helpful next time you want to use PostgreSQL for something outside of relational data. Also, I hope you can see that it can be possible to replace 10 database systems with just one, and that by doing so you can a gain significant ops advantage.

Any corrections or suggestions? Please leave a comment, or see you on twitter @renedudfield
There was discussion on hn and python reddit.

Tuesday, February 07, 2017

Gradual Packaging, with python - Part two

Gradual Packaging - allow packaging simple python code with very little effort. Later on (if needed) you can create a more polished package. Follow conventions, support the simple cases. If you need to go out of the simple cases, allow a path towards better packaging.

Evolution of your python code.

  1. in your ipython session messing around.
  2. paste it into -> (everything starts with a bad name in a single .py file)
  3. -> (then you add a test. Or did you do this first?)
  4. -> (now you rename the file to something better)
  5. afolderoffiles/ -> (now there is a folder of files)
  6. add docs/ and .travisci CI files
  7. configure coverage testing
  8. configure an annoying mypy static type checking option
  9. add flakes testing config tweak
  10. support some weird debian derived distro
  11. appveyor CI added next to travisci
  12. pytest config (change because... ARG reasons).
  13. add a requirements.txt, and modify your setup file to have them too.
  14. remove testing stuff out of requirements.txt into
  15. ... (config tweaks happen over the weeks and months)...
  16. ...
  17. Giant ball of 30 config file soup. -> (We have arrived at the modern repo!)

"Cool hackathon! What did you do?  - I packaged my python code, added test support, setup a nice git repo, a pypi page, a readthedocs page, configured travisci, search the internet for hours on how to upload a python package. Nice one - look at my finished app."
Get something out the door really quickly, and gradually improve the packaging. Or maybe your code isn't all that special and it's ok using the default packaging choices.

So, how do we support this workflow?

I started making a tool for doing the zero config releases. It's not done yet, but it is almost able to release itself. This follows on from a series of blog posts and experiments on:
My mission is to improve python packaging for newbies, those in the digital arts community. People using pygame. Also for 90% of python programmers who never release a python package, and yet work with python every day. This is based on feedback from different people in the python community who teach python programming, my own experience teaching it and mentoring teams. The scientific python community is another group which finds it hard to create python packages.

The humble experiment so far.
 (still not sure if it's a good idea... but we will see)


File layouts (based on simple and common layouts often used):

The basic steps at runtime are these.

  1. Gather facts, like author name and email. (much like ansible if you know it)
  2. Create, setup.cfg, files in a temp folder
  3. build sdist in that temp folder
  4. Upload files to pypi (eventually other release targets, including pyweek)
  5. tag a new version in git (if they are using git)

Tell the user what is happening.

The tool should also teach people about python packaging. It should allow you to see the generated file, setup.cfg, and It should point to the Python Packaging User Guide. It should print out the browser for creating a pypi account, and tell people what config files they need to fill in. It shouldn't require git, or a github account, but support that workflow if they do.

Tell them why adding each bit of config is useful.

A single .py file is the smallest package of python code.

The simplest package of python code is a .py file. It can be copied into a project and it can work. Or it can be copied into the site-packages folder, and the module will work. People upload python code to share all the time - on websites all over the internet.

Pip supports installing from git repos, from hg repos, from alternative different package indexes, from requirements.txt files, and is adding support for .toml config files. It supports installing from folders on the file system. It supports a .egg file format. It supports a .whl file.

But it doesn't support the simplest, most elegant, oldest of python packages - the single .py file.

Where does the packaging metadata live?

Technically python doesn't need any metadata to install a simple .py file. I mentioned in the first two parts of this series various places where data can be gathered. ~/.pypirc files, .gitrc files, .hgrc files. It can find versions as git tags. It can find __author__, __license__, __version__ variables inside files. description, and longdescription are found at the top of the .py file in the docstring.

It can use what packaging data that has been added already. If you add a requirements.txt, then we should use that. If you add a README.rst, then use that.

Gradually add packaging information as you go, as you need it. Not when your packaging tool thinks you need it.

Why not template generators like sampleproject and cookiecutter?

These tools have their place. They are good if you want to tweak all the config in there, and you know how all the tools work. However, based on feedback from people, it's all too complex still. They want to a couple of .py files and that's it. Renaming code easily is important for some types of python code - especially when you don't even know where your experiment is going. Naming things is hard!
They came to python for the elegance, not the packaging config files.
But they still want to share code easily.

Where next with the pyrelease experiment?

First I want to get the tool into shape so it can at least release itself. I'm going to Gradually Package the humble pyrelease - by keeping it to one file from the beginning :)

It should support single file modules, packages, and also /data/ folders. The simple layouts that people use. As well it supports making a script automatically if it finds a main(). As well it finds dependencies by parsing the python code (I'll add requirements.txt later). So if you import pygame, click, flask etc... it adds them to install_requires in the
  • I want to add logging of facts. eg. 'Found author: "Rene" in ~/.gitrc". Could not find author in ~/.hgrc
  • Suggest additions for missing things. eg. How to create a pypi account, how to make ~.pypirc, git.
  • Have to make uploading to pypi easier, especially when things go wrong.
  • thinking of removing the setuptools dependency... only thing I use it for is find_packages so far. I've been writing another more modern implementation of that anyway. Remove distutils dep too? [0]
  • Pynsist support (or py2exe, pyinstaller, whatever)
  • tests, and tests against test . pypi . python . org
  • "add setup files into this repo" for when the tool fails to be good enough.
  • notice telling people to go to packaging. python .org if they want more
  • decide on convention for screenshots (probably screenshots/ folder)
  • bitbucket, and better hg support.
  • pyweek upload support.
  • try releasing a few more games with it.
  • watch some other people try and use it.
  • keep blogging, and getting feedback.

​A nice thing so far, is that I was easily able to rename the project three times. Just by renaming files and folders. It was first called '' then '' then '', and finally ''. Normally you'd need to modify 10 config files when things happen.

Anyway... I'll try some more days on it and see how it turns out.

[0] Thomas Kluyver - an explanation of some philosophy on 'flit' the tool for minimal config distribution that doesn't use distutils or setuptools.