Using PostgreSQL as a cache?

In the article on his blog Peter asks "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Answer: 14x slower.

Seems about right. Usually Redis is about 4x faster for a simple query like that compared to using PostgreSQL as a cache in my experience. It's why so many people use Redis as a cache. But I'd suggest PostgreSQL is good enough to act as a cache for many people.

Django is pretty slow at fetching from PostgreSQL compared to other python options, so this could explain part of the 14x VS 4x difference.

Note that Django should be adding an index because of ForeignKey. However it's possible it isn't being used, or the table may need to analyze stats again. Also note that Django does not support prepared statements built in, and when you don't use prepared statements you do not use the PostgreSQL query cache. Prepared statements can give a 50% speedup often, because they don't have to do the query parsing and do a new query plan each time (and the query cache can actually be used). Additionally, you need to turn the Django config CONN_MAX_AGE from the default of 0, otherwise it does not even reuse PostgreSQL database connections.

PG read optimization tips for readers who may be interested.
  1. Run VACUUM ANALYZE, or have auto vacuuming on.
  2. Check the query plan with EXPLAIN to confirm it is using an index (.explain(verbose=True) on the query in Django). Paste your explain result into if you can not understand it. If there are RED (horribly wrong) results shown, then you need to tweak something.
  3. Tuning PostgreSQL config. Default homebrew PostgreSQL on Mac isn't really configured for speed for example. Easiest is to use a tool like pgtune and plump in your workload numbers Make your changes and try the EXPLAIN again (or run your benchmarks). Make sure to VACUUM ANALYZE between changes for best effect. This is a very application, OS, and hardware dependent task. If you don't know much about this, first try pgtune.
  4. Consider table layout, column statistics optimizations, and rewriting the table in the order of the index with CLUSTER. REINDEX may also be needed for tables used as caches where they are changed very often.
  5. 'Index only scan' can be used where all the data is in the index, so the row doesn't need to be looked at. Not applicable in this case probably, since indexing on the JSON data probably wouldn't be a good idea. If your cache hit ratio is very low(mostly you are doing network lookups) then it could still be a performance increase.
  6. PREPARE statements. In Django there are external packages available for this.
  7. If you are already using a postgresql connection pool like Pgpool-II, then you can use a query cache there pretty easily. See "in memory query cache".

I wouldn't be surprised if these could take it down to 4x-10x with the Django ORM compared to Redis.

Of course you should probably just cache the views at the CDN/web proxy level, or even at the Django view or template level. So you probably won't even hit the Django app most times. [Edit: and this is exactly what Peter does as mentioned in an update.]


peterbe said…
"you should probably just cache the views at the CDN/web proxy level"

Check out the update!

My original intent, that started the blog post, was: I know Redis is faster for these kinds of lookups. I don't want to over-engineer that to match PG match Redis. But I don't trust Redis's persistence. How about I copy some of these to Redis so the web view that depends on this stuff gets faster.

Then, I realized, just cache the final complete outcome by putting a big `Cache-Control` on it and let Nginx handle it from that point on.

Popular posts from this blog

post modern C tooling - draft 5

Is PostgreSQL good enough?

Experiments with new low latency PyPy garbage collector in a thread.