[Tutor] subtyping builtin type

spir denis.spir at gmail.com
Wed Jan 1 14:49:17 CET 2014


On 01/01/2014 01:26 AM, Steven D'Aprano wrote:
> On Tue, Dec 31, 2013 at 03:35:55PM +0100, spir wrote:
>> Hello,
>>
>> I don't remember exactly how to do that. As an example:
>>
>> class Source (str):
>>      __slots__ = ['i', 'n']
>>      def __init__ (self, string):
>>          self.i = 0                  # current matching index in source
>>          self.n = len(string)        # number of ucodes (Unicode code points)
>>          #~ str.__init__(self, string)
>
> The easiest way to do that is:
>
> class Source(str):
>      def __init__(self, *args, **kwargs):
>          self.i = 0
>          self.n = len(self)

Thank you Steven for your help.

Well, I don't really get everything you say below, about possible alternatives, 
so I'll give a bit more details. The only point of Source is to have a string 
storing current index, somewhat like file (being read) on the filesystem. I take 
the opportunity to add a few features, but would do without Source altogether if 
it were not for 'i'.
The reason is: it is for parsing library, or hand-made parsers. Every matching 
func, representing a pattern (or "rule"), advances in source whenever mathc is 
ok, right? Thus in addition to return the form (of what was matched), they must 
return the new match index:
	return (form, i)
Symmetrically, every match func using another (meaning nearly all) receive this 
pair. (Less annoyingly, every math func also takes i as input, in addition to 
the src str.) (There are also a handful of other annoying points, consequences 
of those ones.)

If I have a string that stores its index, all of this mess is gone. It makes for 
clean and simple interfaces everywhere. Also (one of the consequences) I can 
directly provide match funcs to the user, instead of having to wrap them inside 
a func which only utility is to hide the additional index (in both input & output).

> As a (premature) memory optimization, you can use __slots__ to reduce
> the amount of memory per instance.

You're right! (I did it in fact for 'Form' subtypes, representing match results 
which are constantly instanciated, possibly millions of times in a single parse; 
but on the way i did it to Source as well, which is stupid ;-)

>But this (probably) is the wrong way
> to solve this problem. Your design makes Source a kind of string:
>
> issubclass(Source, str)
> => True
>
> I expect that it should not be. (Obviously I'm making some assumptions
> about the design here.)

Actually, doesn't matter whether issubclass or isinstance are true. But it must 
be a subtype to use string methods (including magic ones like slicing), as you 
say below.

>  To decide whether you should use subclassing
> here, ask yourself a few questions:
>
> * Does it make sense to call string methods on Source objects? In
>    Python 3.3, there are over 40 public string methods. If *just one*
>    of them makes no sense for a Source object, then Source should not
>    be a subclass of str.
>    e.g. source.isnumeric(), source.isidentifier()

Do you really mean "If *just one* of them makes no sense for a Source object, 
then Source should not be a subclass of str." ? Or should I understand "If *only 
one* of them does make sense for a Source object, then Source should not be a 
subclass of str." ?
Also, why? or rather why not make it a subtyp if I only use one method?

Actually, a handful of them are intensely used (indexing, slicing, the series of 
is* [eg isalnum], a few more as the prject moves on). This is far enough for me 
to make it a subtype.
Also, it fits semantically (conceptualy): a src is a str, that just happens to 
store a current index.

> * Do you expect to pass Source objects to arbitrary functions which
>    expect strings, and have the result be meaningful?

No, apart from string methods themselves. It's all internal to the lib.

> * Does it make sense for Source methods to return plain strings?
>    source.upper() returns a str, not a Source object.

Doesn't matter (it's parsing). The result Forms, when they hold snippets, hold 
plain strings, not Source's, thus all is fine.

> * Is a Source thing a kind of string? If so, what's the difference
>    between a Source and a str? Why not just use a str?

see above

>    If all you want is to decorate a string with a couple of extra
>    pieces of information, then a limitation of Python is that you
>    can only do so by subclassing.

That's it. But I don't know of any other solution in other langs, apart from 
composition, which in my view is clearly inferior:
* it does not fit semantics (conception)
* it's annoying syntactically (constant attribute access)

> * Or does a Source thing *include* a string as a component part of
>    it? If that is the case -- and I think it is -- then composition
>    is the right approach.

No, a source is conceptually like a string, not a kind of composite object with 
a string among other fields. (Again, think at a file.)

> The difference between has-a and is-a relationships are critical. I
> expect that the right relationship should be:
>
>      a Source object has a string
>
> rather than "is a string". That makes composition a better design than
> inheritance. Here's a lightweight mutable solution, where all three
> attributes are public and free to be varied after initialisation:

No, see above.

> class Source:
>      def __init__(self, string, i=0, n=None):
>          if n is None:
>              n = len(string)
>          self.i = i
>          self.n = n
>          self.string = string

Wrong solution for my case.

> An immutable solution is nearly as easy:
>
> from collections import namedtuple
>
> class Source(namedtuple("Source", "string i n")):
>      def __new__(cls, string, i=0, n=None):
>          if n is None:
>              n = len(string)
>          return super(Source, cls).__new__(cls, string, i, n)

An immutable version is fine. But what does this version bring me? a Source's 
code-string is immutable already. 'i' does change.

> Here's a version which makes the string attribute immutable, and the i
> and n attributes mutable:
>
> class Source:
>      def __init__(self, string, i=0, n=None):
>          if n is None:
>              n = len(string)
>          self.i = i
>          self.n = n
>          self._string = string
>      @property
>      def string(self):
>          return self._string

Again, what is here better than a plain subtyping of type 'str'? (And I dislike 
the principle of properties; i want to know whether it's a func call or plain 
attr access, on the user side. Bertrand Meyer's "uniform access principle" for 
Eiffel is what I dislike most in this lang ;-) [which has otherwise much to offer].)

Seems I have more to learn ;-) great!

Side-note: after reflexion, I guess I'll get rid of 'n'. 'n' is used each time I 
need in match funcs to check for end-of-source (meaning, in every low-level, 
lexical pattern, the ones that actually "eat" portions of source). I defined 'n' 
to have it at hand, but now I wonder whether it's not in fact less efficient 
than just writing len(src) instead of src.n, everywhere. (Since indeed python 
strings hold their length: it's certainly not an actual func call! Python lies ;-)

Denis


More information about the Tutor mailing list