[Python-ideas] __data__

spir denis.spir at free.fr
Tue Feb 10 13:39:21 CET 2009


I stepped on an issue recently that let me reconsider what we actually do and mean when subtyping, especially when deriving from built-in types. The issue basically is I need to add some information and behaviour to "base data" that can be of any type. (See PS for details.)

When subtyping int, I mean instances of MyInt to basically "act as" an int. This does not mean, in fact, inheriting int's attributes. This means instead the integer specific operations to apply on a very special attribute instead of applying on the global object. For instance, if n is an instance of MyInt with a value of 2, then:

* "print n" will print "2" instead of "<__main__.MyInt object at 0xb7d56ecc>"
* "1 + n" will compute "2 + 1" instead of trying to add 1 to the object itself.

Relevant method calls to the object are passed to an implicit data attribute. This concretely shows as the nice side-effect to avoid writing "n.data" (or "n.value") all the time. But in fact, from the user's point of view, an instance of a derived type such as MyInt basically *is* an int with some additional or some changed features. So that implicit indirection is not only nice: it really reflects the intention.

In other words, an implicit indirection operates. This indirection is precisely what we need to explicitely write down when, instread of deriving from int, we simulate the built-in type's behaviour:

class MyInt(object):
	def __init__(self, value):
		self.value = value
	def __add__(self, other):
		return self.value + other
		# or more precisely: int.__add__(self.value, other)
	def __str__(self):
		return str(self.value)

Whatever their actual implementation, built-in types will target operations toward the real, basic, data item that represents the value.

Ordinary inheritance instead does not cause objects to "act as" whatever base, silently indirecting relevant operations to an implicit data field -- if not explicitely written. There is no special data at all, and no specific behaviour targeted to that data. Moreover, as shown in the simulation case above, making a custom type does not imply any inheritance, rather it simply requires properly targeting relevant methods. There is *no base type*: there is *base data* instead -- which has its own type.

Deriving to let instances "act as" base type instances, finally requires only to tell the interpreter which attribute holds the base data. Then, when a method is called on an instance, if ever this method is not defined for its type, nore for this very instance, the interpreter can (try and) apply it on the basic data instead.

If ever deriving can actually work that way, then there is no reason to declare a parent class at definition time, thus forever closing what kind of data is allowed. The fictional code snippets below give the interpreter all the information it needs:

class MyInt(object):	# no specific base
	def __init__(self, value):
		int.__init__(self, value)
class MyInt(object):
	__data__ = "value"
	def __init__(self, value):
		self.value = value

class MyInt(object):
	def __init__(self, value):
		self.__data__ = value

The first one has the drawback of allowing several base.__init__ calls (desirable feature?), obviously leading to method name clashes. Also, if ever the type is not predictable, we need to use an even more complicated syntax such as
which by the way shows how redondant such a call is (value has to be uselessly specified twice).
The second form simply tells which field is to be considered as holding hte base data, so that the interpreted can properly process indirection. The third version even simplierly stores the base data into a magic field name. In both cases, the absence of a __data__ attribute means there is no basic data, thus inheritance (from object or any base class or classes) and method call operate normally.

Probably some will consider that the present way of achieving the same goal (combining inheritance with possible call to base.__init__ if needed) is simpler, clearer, etc... I will not argue on that point. Still, consider the following:

class MyInt(int):
	def __init__(self, value):
		data = int(value)
		int.__init__(self, data)
class MyInt(int):
	def __init__(self, value):
		data = int(value)
		super(MyInt, self).__init__(self, data))

class MyInt(object):
	__data__ = "data"
	def __init__(self, value):
		self.data = int(value)

class MyInt(object):
	def __init__(self, value):
		self.__data__ = int(value)

I find third and fourth versions self-commenting. The fourth one is simpler and straightforward, but the third one makes the existence of __data__ more obvious (so we know at first sight that this type is able to operate silent indirection).

Moreover --this is the reason why I first had to study that point closer--, the present syntax requires the base type to be unique *and* known at design time. Which must not be. I do not know however if there are many use cases for flexible base types. Still, versions 3 and 4 allow it without any overload: __data__ (or data) can be of any type, and can even change at runtime: what's the point?

la vita e estranya

PS: use case.

I started to write a kind of parser generator based on a grammar inspired by PEG (http://en.wikipedia.org/wiki/Parsing_expression_grammar), like pyparsing. The way I see it implies several requirements for the parse result type:

-0- Basically, valid results of succesful matches can be "empty" (eg from optional or lookahead expression), "single" (simple string), or "sequential" (when generated from repetition or sequence patterns).
-1- Each (sub)result knows its own "nature" (implemented as a reference to the pattern that generated it); so that when patterns contain choices the client codes does not need partially reparse the result, only to know what the result actually is -- which the parser has already determined! Moreover, the client code needs only to comprise minimal knowledge about the grammar, as long as the parser passes by all the information it has collected.
-2- Global and partial results can be output in several practicle formats including a "treeView" that +/- represents the parse tree.
-3- Results can be transformed to prepare further processing -- which may well be only final output. There are several standard methods needed for this (glue a result sequence into a single string, extract nested results, keep only the parse tree's leaves); results can also be converted (e.g. an int representation into an int) or changed in whatever manner using ad hoc functions: eventually result data can well be of custom types. (This is actually a very interesting feature: for instance when parsing wikitext, a "styled_text" result may instantiate a StyledText object as an abstract represention of it, that can e.g. write itself to a DB, another wiki lang, x/html...  

The conceptual nature of a result requires the object to implicitely indirect proper methods to the real result data it holds. For instance, the client code should be able to write an addition on a result if ever additioning this result has a sense, and this operation to be implicited applied on the underlying data. First and second points require a custom type with additional information and behaviour. Third point requires flexibility on the base type.

More information about the Python-ideas mailing list