Tuesday, December 22, 2009

structuring modules/packages and the cdb database for websites and python packages

Integrated modules are nice, but so are modular packages.

How can we have both? That is, keep all things for a module(or sub package) in one directory, but also have a nice integrated system built on top of that?

Often for one package I will have a file layout like so:

/setup.py
/[package_name]
/[package_name]/[package_module].py
/[package_name]/tests/[package_module]_test.py
/[package_name]/examples/[package_module].py
/[package_name]/docs/[package_module].py

Then each module has its tests, docs, and examples all mixed in the one directory. This is nice if you want to have all of the tests together, and all of the docs, and examples together.

However then all of the modules are mixed in together. Meaning it is harder to separate them, and keep one module in its own directory. Having everything for one thing together is nicer for developers I think.

Using namespace packages through setuptools distribute is one way. There is a draft pep around to put namespace packages into python as well. However, this feels way too heavyweight for me. Especially since they add multiple paths to the interpreter... meaning python has to do stat, and open, calls for every single path - which slows down the startup and import speed of python even when not using those packages.

Another way is to use a sub-package for each module. This turns each module into a sub-package.

However, some people put all of their stuff in an __init__.py file. Which is kind of hard to find, and not very descriptive. Also if you are editing a dozen __init__.py files for a project, it is really quite annoying. It is much better to have it in a 'modulename.py' file, and then import that in the __init.py file. Still slower than using modules directly though, since there are still extra stat calls, and extra file opens and reads.

Using the __init__.py file is still annoying in that you need the extra file, and have to remember what this magic file does.

This one liner can put all of the sub packages modules into the name space:
import os,sys
for f in os.listdir('somepackage'):
if os.path.isdir(os.path.join('somepackage',f)):
sys.path.insert(0,os.path.join('somepackage',f))
This is magic itself, and has a pretty bad code smell.

Other issues are playing nicely with things like cx_freeze, py2app, py2exe and the many other tools for managing python packages and modules... who don't know about namespace packages, or my custom magic.

I could write an import hook, or try out importlib in python3.1, but those would probably have similar issues to other non-standard ways.

So in the end, I think I'll stick with a fairly standard method of doing it... using a __init__.py file which imports the proper module into its namespace.

This lets me do import somepackage.somemodule and have it work. I can also "cd somepackage/somemodule/; python somemodule.py" and have it work.


/setup.py
/[package_name]
/[package_name]/[package_module]/
/[package_name]/[package_module]/[package_module]_test.py
/[package_name]/[package_module]/[package_module]_example.py
/[package_name]/[package_module]/[package_module]_docs.txt
This is how I'll structure pywebsite. With a separate directory for each sub module. This gives modularity, making it scalable to many modules, and it will simplify contributors lives. Since it uses standard packaging it should be compatible with most tools. Another bonus is that from within the source download I can just do:
import pywebsite
without having to call the setup.py script or install it. If a module changes from a .py file to a .so or a .dll or the other way around I won't have any issues either.

It is now a little harder for sub-packages within a package to refer to other sub-packages, but can be done with newer versions of python.

Sub-packages are more heavyweight than just using modules in your package... but it does seem cleaner and more extensible. However it is not as heavyweight as using namespace packages.

Any other issues with this approach?

namespace packages


Using the method outlined above, still does not allow me to split up a package into multiple separate files. This is ok though, since to start with I want to keep most of the modules in the one place... in the one repository, with the one bug tracker. However designing for extensibility from the start is useful, so we consider how it can be done.

This is where namespace packages are useful.

There is this draft pep: pep-0382, and also the setuptools distribute namespace packages (which is what everyone is using to do them). Here is the setuptools distribute documentation for namespace packages.

It should be possible to use some sort of hack, so that once you import a package, it searches for other packages with the same namespace.

You can use a packages __path__ attribute to tell it to look in other paths for importing modules.

>>> import pywebsite
>>> pywebsite.__path__
['pywebsite']
>>> pywebsite.__path__.append('lala')
>>> import pywebsite.bla
>>> pywebsite.bla.__file__
'lala/bla.py'

Using __path__ is outlined in the Packages in Multiple Directories part of the python tutorial.

As an example, say I had a 'otherpackage' package, and then someone else wanted to maintain part of that package, or we wanted to separate part of it out. Let's call this namespace package: 'otherpackage_doit'. It could install itself as a directory and package called 'otherpackage_doit'. Then import otherpackage_doit would work fine. However import pywebsite.doit would not work. You can't just call the package 'otherpackage.doit' either - since python will first look in the otherpackage package for the doit package, making import otherpackage.doit fail.

From a users discovery perspective, I would expect otherpackage.doit to be in otherpackage/doit. So that's where I'd look first. Installing into that directory would probably be best then. However that is not a very good method. After that, I'd probably do "print(otherpackage.doit.__file__)". Or I might do a "locate otherpackage.doit" command.

Really I just wish python3.2 could be changed so that 'otherpackage.doit' package is automatically a namespace package - without having to mess around with weird magic .pth files or declaring things in setup files like setuptools does.

So how can we retrofit(hack) existing pythons to do this for us? We need to get the python import machinery to search for otherpackage.* packages outside of the otherpackage directory. I'm sure it's possible with python somehow... Inserting 'otherpackage.doit' into sys.path does not work. You can't even have a package name with a '.' in it.

So I'll give up on namespace packages for now until a suitable option presents itself... or I have more time for research. Separate packages will have to live in the same file, but can still be separate with source control tools - like bzr, svn, github etc.

Still not fast enough... cdb databases for websites and python packages


However, the standard python package method is not the fastest. Supporting cgi operations for a web library is a good idea. This is because many webhosting platforms still only support python through cgi. So loading heaps of files for every cgi request is not an option. It is possible to get acceptable performance out of cgi and python... just many of the large frameworks have poorly optimized loading. Many frameworks rely on long running processes to avoid the slow load times. Using django via cgi in an embedded 130mhz arm with a limit of 10MiB is not going to work very well (or at all).

So how to make it faster for embedded/cgi apps?

Firstly an executable can be made. Using tools like py2exe. This can pack all of your data inside the executable.

One common method people try is to use the zip format. This works fairly well but is not optimal. Zip files are nice as they are supported by OS level tools, and file managers - so this will be one option to use. The downside, is that it makes the files harder to edit. I see .zip files as an optimisation that hinders usability. Especially .egg files(which are just .zip files) are bad, as it makes it harder to debug or change programs. So like .pyc files I think the zip file should be generated as needed - but having the full source tree there to change is very useful. If someone changes the source, the zip file should be regenerated as needed.

Another option is a constant database (cdb). cdb is a very simple constant database format used in things like djbdns, qmail and for other things. cdb happens to be one of the fastest(if not the fastest for constant databases [benchmarks pdf]). cdb is perfect for python packages that are not meant to change since they are so quick.

cdb is also pretty good for serving data from websites. Since often with websites much of the data is mostly static(constant) - a cdb key/value database is a nice optimisation over files on the file system. There are less syscalls, and less latency issues.

Zip files can also be used as .jar files by some browsers(firefox), to reduce latency on websites too. See the jar url scheme for details on how to put all your static files into a zip file(jar file).

There are some of my python module_experiments in:
bzr co http://rene.f0o.com/~rene/stuff/module_experiments
http://rene.f0o.com/~rene/stuff/module_experiments.tar.gz

So I have now refactored pywebsite to use a sub-package for each module, so that all the tests, docs and examples for that module are within its own sub-package. Using a zip/cdb file for imports will be left for later, as will namespace packages.

No comments: