[Python-Dev] Re: Alternative implementation of string interning
Oren Tirosh
oren-py-d@hishome.net
Wed, 3 Jul 2002 15:21:34 -0400
On Wed, Jul 03, 2002 at 02:17:12PM -0400, Jeremy Hylton wrote:
> >>>>> "OT" == Oren Tirosh <oren-py-d@hishome.net> writes:
>
> OT> This thought experiment is part of a strange fantasy I have that
> OT> Python might one day use only interned strings to represent
> OT> names. There are relatively few places where a string may be
> OT> converted to a name (getattr, hasattr, etc) and these could be
> OT> interned at the interface if interned strings are not
> OT> immortal. I expect that nothing will ever come out of this, but
> OT> it's fun to think about it anyway...
>
> two responses:
>
> What do you mean by "represent names"? Code objects already use
> interned strings for names. Did you have something else in mind?
Not something else - just more of the same. Interned names in co_names
tuples are a good start but there are tons of places where literal C-strings
are used such as in descriptors. These names are converted to temporary
Python strings on demand. My humble goal is for any name that has a
predefined meaning in Python to appear exactly once in the executable and
that instance will be in the form of a static preinitialized Python string
object, not a C string literal.
Here's how it might work: to use the name 'foo' you just refer to the C
name PYSYMfoo. During build a helper program scans all C sources for names
starting with PYSYM and automatically generates a .c file where each of
these names appears once as a pre-initialized string object and an .h file
included by Python.h. On startup all these string objects are interned, of
course.
So any name used from C is resolved by the linker to point to the interned
single instance. Any name appearing unquoted in Python code is interned
when when it's compiled or loaded from the .pyc file. There are some cases
where a string becomes a name such as the arguments to functions like
getattr and hasattr. These would need to be interned before reaching the
100% interned core of the language. I guess this could be done by a new
PyArgs_ParseTuple format char. This obviously requires interned strings to
be non-immortal.
For example:
if (strcmp(sname, "__class__") == 0)
becomes
if (if sname == PYSYM__class__)
This is a pretty trivial example but I have other ideas for optimizations
and cleanups that this would enable. These might lead to significant
improvements in code size and performance.
Well, that's my fantasy. There are still some "minor" problems like totally
breaking the C API.
Oren