Thursday, February 04, 2010

python - unifying c types from different packages.

Python already has a number of objects to represent c types. However, there is a need to improve interoperability between systems using these c types. Below I explain the need, and discuss existing efforts to address this need. Then ways to transparently translate between the various type systems without each system needing to know about each other are also discussed.

In the ctypes library - you can represent an unsigned 32bit integer with ctypes.c_unit32.

In the array, and struct modules there are different array type codes. For example, 'L' represents unsigned int with a minimum of 4 bytes on 32bit systems and 8 on 64bit systems.

numpy, cython, pyopengl and other python extensions have their own types representing c types too. Most extensions which link up to languages which use static typing represent basic c types to python in some way.

Not only libraries, but various compilers and translation tools also use c types. For example tinypyC++, cython, swig, etc. Also type inference is done from things like shedskin, and rpython - but they represent types internally with their own type objects.

Standardising on one set of c type objects or string codes would give some compatibility advantages. However, that will be hard to change for backwards compatibility reasons. A mapping between the various types should provide plenty of the advantages. For example, to be able to translate from a ctypes to a numpy type should be fairly simple.

Here you can see that numpy, ctypes and the python.array module already have integration:

>>> import numpy, cython, ctypes, OpenGL.GL

>>> numpy.array([1,2,3,4.2], ctypes.c_uint32)
array([1, 2, 3, 4], dtype=uint32)

>>> numpy.array([1,2,3,4.2], numpy.uint32)
array([1, 2, 3, 4], dtype=uint32)

>>> numpy.array([1,2,3,4.2], 'L')
array([1, 2, 3, 4], dtype=uint32)

>>> numpy.array([1,2,3,4.2], OpenGL.GL.GLuint)
array([1, 2, 3, 4], dtype=uint32)

>>> numpy.array([1,2,3,4.2], cython.uint)
------------------------------------------------------------
Traceback (most recent call last):
File "", line 1, in
TypeError: data type not understood

Pretty cool hey? Numpy already knows about many of the type variables available in the python ecosystem. With the notable exception of cython.

I think there is a need to try and standardise use of c type variables - so that more code can interoperate without each system needing to know about each other systems type objects. Alternatively a translation layer can be made in place.

For example an adaptor something like this:
# this registers two types which are the same.
>>> type_registry.register_types(numpy.uint32,
... cython.uint)

# here numpy.array does not know about cython directly,
# but can look at the registered type we just did to get it from there.
>>> numpy.array([1,2,3,4.2], cython.uint)
array([1, 2, 3, 4], dtype=uint32)

# if numpy does not know about the adaptor registry then we can still
# use the registry, if only in a more ugly - non transparent way
# by calling a translate function directly:
>>> numpy.array([1,2,3,4.2],
... type_registry.translate(cython.uint))
array([1, 2, 3, 4], dtype=uint32)

Instead of an adaptor, a magic variable could be used which would contain the 'standard c type variable' from python. For example - cython.uint.__ctype__ == ctype.c_unit32. Then numpy could look for a __ctype__ variable and use that - without having to be extended for every system that is made. One problem with a magic variable over registered types is that some python objects can not have those magic variables assigned. For example, try adding a __ctype__ variable to an int instance - it won't work.

Either the adaptor, or the magic variable would let cython - and other systems use their own type objects and still have a way to translate the types to the standard python c type variables (when/if they are chosen).

A simple mapping (with a dict) from a package to the standard c type objects/type codes is another method that could be used. This will allow a package to fairly easily hook into the eco system. For example cython could have a __c_type_mappings__ magic variable at the top level of its package. Then another package looking to translate the type could look to the package for this __c_type_mappings__ variable. The advantage of this is that many times variables can be injected into a package but not into extension types in the package. On the other hand this feels icky.

The c types from the ctype package seem to be a fairly good choice for this. PyopenGL 3.x series uses the ctypes as its types. eg, OpenGL.GL.GLuint == ctypes.c_uint32. Except ctypes is a fairly big dependency just for a few types.

The buffer pep 3118 being introduced into python/numpy to make buffer sharing between libraries is a similar use case. However it involves sharing instances of differently typed buffers - and has quite clever semantics for many use cases. The formats from that pep could also probably be used to share type information.

The buffer protocol pep specifies extra format strings over the ones specified in the python.array module. So as to be able to specify a more complete set of type, and memory layouts. So rather than using the ctypes types, it probably makes sense to use the new buffer protocol format codes specified in pep 3118. As they are just strings without any further dependencies on the rest of the ctypes machinery (eg libffi etc). Of course, if you are using ctypes already - then depending on it is not a problem.

Of course ctypes.c_uint32 is more descriptive than 'L' to many people, so the format codes (eg 'L') should just be used for specification. People should still use their own type objects - but provide a translation to their format codes as specified in pep 3118 the new buffer protocol.

The codes specified in pep 3118 will probably need to be expanded as more c types need to be described. For example bit fields and bit depths of types are not described in the pep. Many systems specify the bit depth of the type - numpy, ctypes, opengl, etc. For example they use 'uint32' rather than 'unsigned int'. Also bit fields are becoming more common in C so they should be added to the type code formats in someway too.

In conclusion there is a need for interoperability of c types from various python extensions, libraries and python compilers. The pep 3118 format codes, and ctypes types are good candidates to work with for standard c type objects/codes. Adaptor registries, simple mappings, and/or magic variable names could be used to enhance interoperability.


1 comment:

三八 said...
This comment has been removed by a blog administrator.