Friday, July 20, 2007

My issues with python ORMs

If a python ORM you know of addresses these issues, please let me know.

Python ORMs break with multiple processes.


Multiple processes are not assumed. Python ORMs all seem to use heavy local caching, which fails when the database is modified by another process. This is unfortunate as for me I like to use different tools for different jobs. Or there might be different people I work with that write tools in different processes. Or even the common case that each web request is run in a different process - or on a different machine.

It is not commonly known that just because something outside of a python app changes a database that the python app will break. Most applications that use databases do not break if the database changes from outside of the application.

Using memcache or something like it seems to be a solution to some of this problem. Also optionally allowing the ORM to not cache certain queries - or even stopping all caching. Caching using python dicts is bad anyway, because of pythons memory wastage.

Constructors are limited to creation.


Insert seems to be assumed to be the most common operation. So python ORMs seem to only allow you to do inserts with the constructors.

Having constructors which get a row based on a primary key would be nice. Since that is one of the most common things I do.
eg. 'p = Person(1)' would get the person with primary key 1.

A row is not the only use of mapping.


Even being able to do no select, insert or update in the constructor is useful.
eg. 'p = Person()' would create a person instance which is so far empty. I use this to shortcut things I want to do with that table - not on a row of that table.
So I can do things like this:
eg. 'rows = p.get_all("active=3")' It's just shorter for me to type this stuff, easier to remember how to do things, and is easier to read(I think).
eg2. p.save({'id':1, "name":"george"})

By using methods on a class which says 'I am acting on this table' you can shorten a lot of code. The two examples above show how much shorter, and simpler select and save are. The methods of the instance don't necessarily act on the row that instance might represent - but on the table that instance represents.

Having to pass each attribute by name.


I like to do code like this:
table.save(vars)

Not this:
t = Table(a=2, b = "asdfsadf", g="123123")

Where vars is a dict that has all of the things I want to save. If vars is a dict from a webpage where you use POST variables, then you should be able to recognize time savings. Also if you update the fields in a table you don't need to update the code here.

That way it can either do an insert or update depending if it already exists or not(determined by a primary key often).

The niceness of this is that I don't need to recode the save part depending on which variables the table takes. Also if there are extra variables passed in they are ignored. I don't need to write each individual attribute for each table.

A workaround would be to override the constructors to do this. But this behavior is not built in by default - and most python ORMs use the python constructor to pass in arguments.

Using python constructors you get a TypeError if you pass it unexpected variables. So for me that is a major problem for using the python constructor for this, and not having a separate save, insert or update methods. The python class creation semantics are not exactly what you want at all times.

Issues can be worked around.


These last issues are caused because I want a python relational mapper, not really an object relational mapper. Python doesn't force you to use objects for everything - neither should a relational mapper for python. Well, maybe it should - if the authors want that.

I work around these issues myself, so I can build my favourite API on top of existing python ORMs. But I'm wondering if I missed some solutions to these issues that are already in the python ORMs? Or why people don't see these as problems?

9 comments:

Doug Napoleone said...

Issue #1:

What ORM are you talking about which has this problem? Django ORM and SQLAlchemy do not suffer from this. For PyCon we used django ORM with FastCGI (5-20 processes) with no issues. I know of NO system which suffers from this. Are you talking about systems which are not transaction aware and do not deal with collisions?

I can understand most of your other issues, but the ones which do not already just work in all the ORM's I know of, strike me as just BAD programing practices. If I caught anyone ignoring passed in information on construction or using a constructor as a static getter, and they tried to explain to me why it was a 'good thing' I would fire them on the spot. In fact I have done almost exactly that.

I would insist on having construction always create a new instance. That is the OO way, and why it is called a 'constructor'. I also insist on having something on the class for the static getter's, in any language. Some systems use a lazy evaluation object system to achieve this. Like in Django: Person.object.all(), Person.objects.get(id=3), Person.objects.exclude(age_gt=30). Most ORM's these days key the 'get' methods off the class in one way or another like this.

As for setting with a dictionary, this is how django already works, and for SQLAlchemy what is wrong with Person.save(**vars)? You don't like the two '*' characters? Also SQLAlchemy, Django, and Storm have methods for ignoring extra info on a set_values for instance (though not on construction). I understand the reason for excluding information in supersets, just not on construction or immediate save. A simple decorator can be created for adding this 'ignore extra' feature to any of the existing ORM constructors (though as I have stated, I see this as a bad thing).

Graham Dumpleton said...

FWIW, one example of a Python web framework where there has been concerns in the past over whether the database layer used works properly in a multiprocess server is TurboGears. I have asked a couple of times through the TurboGears list as to whether there is still truth behind the concerns or not and authoritative answers have been few and far between. The only real comment was to the effect that in TG 2.0 that they would hope that any issues with the database layer if there are any will go away when they move to the database layer used by Pylons.

illume said...

If each processes caches selects, then how are they supposed to know when something changes outside of them self? They need to either do no caching, or use caching that is somehow notified of database changes from outside of them self.

Sounds like I'd fire you on the spot. ;) No not really - I'd never hire you in the first place, or ever work for you, or ever sit in the same room as you. In fact, I'd avoid the country that you live in, and if possible I would move to earth2. No, not really.

Despite the risk of getting fired for trying to explain something to you... here goes.

Construction of a class, and insertion of a row are not intimately tied. Just because an ORM forces you to do this, doesn't mean it's right.

Construction creating a new python instance and creating new data in the db, and Construction getting a new python instance from data in the db can be viewed in a similar way. I mean you are creating an instance in python which maps to something in the database. If that data is already in the database or inserted on Construction are different things - but for many uses the difference does not matter.

Selects are most often the most popular use case, not inserts. So creating instances with selects on Construction rather than inserts makes more sense to me.

yeah, I guess you can make Constructors ignore extra data by always using the **kwargs.

Ignoring extra data allows you to pass in data you don't care about. The validation can use only data it cares about. If enough data is not there, it can complain. For practical purposes, you often have more form information passed through that your save method does not care about.

This data you don't care about includes the x,y position that someone clicked the submit button at - or whole other sections of data. As I don't like to validate data in multiple places, it makes sense to filter what data is valid inside the save method. To do it in one place you have to ignore data it doesn't care about. Otherwise you need a step before saving which asks your model to filter the data first. Or you need to manually specify what data to pass in to the save method.

Doug Napoleone said...

Sorry, that is a bit over the top of me, but as a case in point on a C++ project someone decided that a constructor would instead do a lookup of information and pull from that global pool. This caused far reaching bugs which were extremely difficult to diagnose and debug. It put the project almost a month behind schedule. They were fired. I was not the one who made the decision to fire them, but I did have to do the firing and re-architect the broken system. I hated the entire experience.

I agree that if you allow for creating row instances as python objects on construction, if allowed, would necessitate the 'extra options' feature. I think the difference here is between an Object Relational Mapping, and a Object Relational Model. In the Model, each OO operation (creation, deletion, data change, inheritance). Inheritance == One2One mapping when done right. I really do not care that there is SQL as the backend. Maybe it's the ZODB which is an object store 'database'; there are no rows, just objects and python objects at that. In this example, constructing a new instance returning a pre-existing instance goes directly against everything about the OO model in practice and in implementation.

I do not want SQL in my ORM. In fact I don't want to know about what is storing the data at all. This to me is the major failing of all modern ORM's. In order to use them to their full potential, you must still know all about the back end.

Caching selects? Django does not cache select results. Yes the data is transformed into python objects which are not tied to the database cell, and do not know when that cell changes. This is no different from almost all remote database interfaces (except the extended oracle network system and systems which do not transform the data but do the work 100% on the server and thus are not mappings or models).

This isn't really just a problem of processes, but of DB connections, which can be made from multiple machines. For this modern DB's have row level locks and transactions. There are transactions, and you can detect when a collision has occured and it supports row level locking. These are the same problems that almost all remote database interfaces have. This is a very difficult problem, and I have yet to see it 'properly' solved in any language on any system. As an engineer I am not sure how to do it without very expensive concurrent RPC back to the DB server for every data read on each individual cell, or a proactive post event system. I did work on such an event system once, and the added overhead made it next to useless for large active datasets. Preemptive and reactionary transactions were just better.

mike bayer said...

sqlalchemy provides the "python relational mapper" you describe. the ORM portion of SA is entirely optional and you can just work with SQL constructs that receive and deliver lists and dictionary-types of structures...theyre a great basis from which to build your own CRUD-like tools.

Doug Napoleone said...

A friend just sent me a rather long e-mail explaining exactly how wrong I am about multiple processes, and caching/mapping of select results. I am wrong on multiple levels.

There is a project which does not use SQL statements at all but directly transacts with the backend. By talking directly (via RPC) to the raw relational interface, you do not need to tranlate your python object into SQL statements that are then translated back into the RPC calls anyway. This gives a huge performance boost and solves the 'transactional' model problems. There was a lightning talk given on the system at PyCon, but I do not have any links.

mike bayer said...

mysql-ndb ?

http://dev.mysql.com/doc/ndbapi/en/index.html

Monty Taylor did a lightning talk about it, and he was working on a SQLAlchemy dialect to reference SQLAlchemy expressions directly to NDB without any SQL strings generated; we added some hooks for this to be possible. though i dont know its status, except for this: http://lists.mysql.com/ndb-connectors/27

DataBlaze said...

You said:

"It is not commonly known that just because something outside of a python app changes a database that the python app will break."

I do not understand what you mean here. What is it that breaks? Under what circumstances? What kind of breakage, stale results, exceptions, crash?

I had many situations where I modified the database outside the python instance and have not experienced any kind of errors.

illume said...

Yes, old results.