[Python-Dev] Re: Alternative implementation of string interning

Oren Tirosh oren-py-d@hishome.net
Wed, 3 Jul 2002 15:21:34 -0400


On Wed, Jul 03, 2002 at 02:17:12PM -0400, Jeremy Hylton wrote:
> >>>>> "OT" == Oren Tirosh <oren-py-d@hishome.net> writes:
> 
>   OT> This thought experiment is part of a strange fantasy I have that
>   OT> Python might one day use only interned strings to represent
>   OT> names. There are relatively few places where a string may be
>   OT> converted to a name (getattr, hasattr, etc) and these could be
>   OT> interned at the interface if interned strings are not
>   OT> immortal. I expect that nothing will ever come out of this, but
>   OT> it's fun to think about it anyway...
> 
> two responses:
> 
> What do you mean by "represent names"?  Code objects already use
> interned strings for names.  Did you have something else in mind?

Not something else - just more of the same.  Interned names in co_names 
tuples are a good start but there are tons of places where literal C-strings 
are used such as in descriptors. These names are converted to temporary
Python strings on demand.  My humble goal is for any name that has a 
predefined meaning in Python to appear exactly once in the executable and 
that instance will be in the form of a static preinitialized Python string 
object, not a C string literal.

Here's how it might work: to use the name 'foo' you just refer to the C
name PYSYMfoo.  During build a helper program scans all C sources for names 
starting with PYSYM and automatically generates a .c file where each of 
these names appears once as a pre-initialized string object and an .h file 
included by Python.h. On startup all these string objects are interned, of
course.

So any name used from C is resolved by the linker to point to the interned
single instance.  Any name appearing unquoted in Python code is interned 
when when it's compiled or loaded from the .pyc file.  There are some cases
where a string becomes a name such as the arguments to functions like 
getattr and hasattr. These would need to be interned before reaching the 
100% interned core of the language. I guess this could be done by a new
PyArgs_ParseTuple format char.  This obviously requires interned strings to
be non-immortal.

For example:

	if (strcmp(sname, "__class__") == 0) 
becomes
	if (if sname == PYSYM__class__)

This is a pretty trivial example but I have other ideas for optimizations
and cleanups that this would enable. These might lead to significant
improvements in code size and performance.

Well, that's my fantasy. There are still some "minor" problems like totally 
breaking the C API.

	Oren