[Python-Dev] Alternative implementation of string interning
Oren Tirosh
oren-py-d@hishome.net
Wed, 3 Jul 2002 00:52:11 -0400
On Tue, Jul 02, 2002 at 03:31:03PM -0400, Tim Peters wrote:
> I would have guessed you had a more vivid imagination <wink>. It's
> precisely because the id has been guaranteed that a program may not care to
> save a reference to an interned string. For example,
>
> """
> _ids = map(id, map(intern, "if then elif else".split()))
> TOKEN_IF, TOKEN_THEN, TOKEN_ELIF, TOKEN_ELSE, TOKEN_NAME = range(5)
> id2token = dict(zip(_ids, range(4)))
> del _ids
>
> def tokenvector(s):
> return [id2token.get(id(intern(word)), TOKEN_NAME)
> for word in s.split()]
>
> print tokenvector("if this is the example, then what's the question?")
> """
>
> This works reliably today to classify tokens. I'm not certain I'd care if
> it broke, but we have to consider that it hasn't been difficult to write
> code that would break.
Ironically, this code is actually slower than using the strings themselves as
keys (interned or not). But I get the point.
> > Now for something a bit more radical:
> >
> > Why not make interned strings a type? <type 'istr'> could be an
> > un-subclassable subclass of string. intern would just be an
> > alias for this type. No two istr instances are equal unless they are
> > identical. I guess PyString_CheckExact would need to be changed to
> > accept either String or InternedString.
>
> What would the point be? That is, instead of "why not?", why? As to "why
> not?", there's something about elevating what's basically an optimization
> hack to a type that makes me squirm.
Change the name from 'istr' to 'symbol' and add a mild case of language envy
and you'll see why ;-)
Oren