[Python-Dev] Alternative implementation of string interning

Oren Tirosh oren-py-d@hishome.net
Wed, 3 Jul 2002 00:52:11 -0400


On Tue, Jul 02, 2002 at 03:31:03PM -0400, Tim Peters wrote:
> I would have guessed you had a more vivid imagination <wink>.  It's
> precisely because the id has been guaranteed that a program may not care to
> save a reference to an interned string.  For example,
> 
> """
> _ids = map(id, map(intern, "if then elif else".split()))
> TOKEN_IF, TOKEN_THEN, TOKEN_ELIF, TOKEN_ELSE, TOKEN_NAME = range(5)
> id2token = dict(zip(_ids, range(4)))
> del _ids
> 
> def tokenvector(s):
>     return [id2token.get(id(intern(word)), TOKEN_NAME)
>             for word in s.split()]
> 
> print tokenvector("if this is the example, then what's the question?")
> """
> 
> This works reliably today to classify tokens.  I'm not certain I'd care if
> it broke, but we have to consider that it hasn't been difficult to write
> code that would break.

Ironically, this code is actually slower than using the strings themselves as 
keys (interned or not).  But I get the point.

> > Now for something a bit more radical:
> >
> > Why not make interned strings a type?  <type 'istr'> could be an
> > un-subclassable subclass of string.  intern would just be an
> > alias for this type.  No two istr instances are equal unless they are
> > identical.  I guess PyString_CheckExact would need to be changed to
> > accept either String or InternedString.
> 
> What would the point be?  That is, instead of "why not?", why?  As to "why
> not?", there's something about elevating what's basically an optimization
> hack to a type that makes me squirm.

Change the name from 'istr' to 'symbol' and add a mild case of language envy
and you'll see why ;-)

	Oren