[Patches] [ python-Patches-576101 ] Alternative implementation of interning

noreply@sourceforge.net noreply@sourceforge.net
Wed, 14 Aug 2002 18:26:50 -0700


Patches item #576101, was opened at 2002-07-01 15:23
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=576101&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Oren Tirosh (orenti)
Assigned to: Guido van Rossum (gvanrossum)
Summary: Alternative implementation of interning

Initial Comment:
An interned string has a flag set indicating that it is 
interned instead of a pointer to the interned string. This 
pointer was almost always either NULL or pointing to the 
same object. The other cases were rare and ineffective 
as an optimization.  This saves an average of 3 bytes 
per string.

Interned strings are no longer immortal.  They are 
automatically destroyed when there are no more 
references to them except the global dictionary of 
interned strings.

New function (actually a macro) PyString_CheckInterned 
to check whether a string is interned.  There are no 
more references to ob_sinterned anywhere outside 
stringobject.c.


----------------------------------------------------------------------

>Comment By: Guido van Rossum (gvanrossum)
Date: 2002-08-14 21:26

Message:
Logged In: YES 
user_id=6380

> - why does this try to "fix" the problem of
> dangling interned strings? AFAICT: if there is a
> reference to an interned string at the time
> _Py_ReleaseInternedStrings is called, that
> reference is silently dropped, and a later
> DECREF will result in memory corruption. IOW: it
> should merely set the state of all strings to
> normal, and clear the dict.

Note that the *only* time when
_Py_ReleaseInternedStrings() can ever be called is
at program exit, just before you run a memory leak
detector.  There's no way Python can be
resurrected after _Py_ReleaseInternedStrings() has
run.

> - Replacing PyString_InternInPlace with
> PyString_Intern seems dangerous. AFAICT, the
> fragment
> 
> 	PyString_InternInPlace(&name);
> 	Py_DECREF(name);
> 	return PyString_AS_STRING(name);
> 
> from getclassname would break: Intern() would
> return the only reference to the interned string
> (assuming this is the first usage), and
> getclassname drops this reference, returning a
> pointer to deallocated memory. I'm not sure
> though why getclassname interns the result in
> the first place.

getclassname() is doing something very unsavory
here!  I expect that its API will have to be
changed to copy the name into a buffer provided by
the caller.

We'll have to scrutinize all calls for tricks like
this.

> Selectively replacing them might be a good idea,
> though. For intern(), I think an optional
> argument strongref needs to be provided (the
> interned dict essentially weak-references the
> strings). Perhaps the default even needs to be
> weakref.

So do you think there's a need for immortal
strings?  What is that need?


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-08-14 18:24

Message:
Logged In: YES 
user_id=21627

Some mutually unrelated comments:

- the GC_UnTrack call for interned is not need: GC won't be
able to explain the reference that stringobject.c holds.

- why does this try to "fix" the problem of dangling
interned strings? AFAICT: if there is a reference to an
interned string at the time _Py_ReleaseInternedStrings is
called, that reference is silently dropped, and a later
DECREF will result in memory corruption. IOW: it should
merely set the state of all strings to normal, and clear the
dict.

- Replacing PyString_InternInPlace with PyString_Intern
seems dangerous. AFAICT, the fragment

	PyString_InternInPlace(&name);
	Py_DECREF(name);
	return PyString_AS_STRING(name);

from getclassname would break: Intern() would return the
only reference to the interned string (assuming this is the
first usage), and getclassname drops this reference,
returning a pointer to deallocated memory. I'm not sure
though why getclassname interns the result in the first place.

Selectively replacing them might be a good idea, though. For
intern(), I think an optional argument strongref needs to be
provided (the interned dict essentially weak-references the
strings). Perhaps the default even needs to be weakref.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-08-14 15:32

Message:
Logged In: YES 
user_id=6380

Question for all other reviewers. Why not replace all calls
to PyString_InternInplace() [which creates immortal strings]
with PyString_Intern(), making all (core) uses of interning
yield mortal strings?

E.g. the call in PyObject_SetAttr() will immortalize all
strings that are ever used as a key on a setattr operation;
in a long-lived server like Zope this is a concern, since
setattr keys are often user-provided data: an endless stream
of user-provided data will grow the interned dict indefinitely.

And having the builtin intern() always return an immortal
string also limits the usability of intern().

Most of the uses I could find of PyString_InternFromString()
hold on to a global ref to the object, making it immortal
anyway; but why should that function itself force the string
to be immortal?  (Especially since the exceptions are things
like PyObject_GetAttrString() and PyObject_SetItemString(),
which have the same concerns as PyObject_SetItem().


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-08-14 14:43

Message:
Logged In: YES 
user_id=6380

Here's an update of the patch for current CVS
(stringobject.h failed due to changes for
PyAPI_DATA/PyAPI_FUNC).

Could you add documentation to Doc/api/concrete.tex for
PyString_Intern() and explains how PyString_InternInPlace()
differs? (AFAICT it makes the interned string immortal -- I
suppose this is a B/W compat feature?)

The variables PYTHON_API_VERSION and PYTHON_API_STRING in
modsupport.h need to be updated -- many extensions use the
PyString_AS_STRING() macro which relies on the string object
format. If an extension compiled with the old code is linked
with the new interpreter, it will miss the first three bytes
of string objects -- or even store into memory it doesn't
own! (I've already added this to the patch I am uploading.)

(The test_gc failures were unrelated; Tim has fixed this
already in CVS.)

I'm tempted to say that except for the API doc issue this is
complete. Thanks!

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-08-10 06:01

Message:
Logged In: YES 
user_id=562624

General cleanup. Better handling of immortal interned
strings for backward compatibility.

It passes regrtest but causes test_gc to leak 20 objects. 13
from test_finalizer_newclass and 7 from test_del_newclass,
but only if test_saveall is used. I've tried earlier
versions of this patch (which were ok at the time) and they
now create this leak too.

----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-07-06 12:08

Message:
Logged In: YES 
user_id=562624

Oops, forgot to actually attach the patch. Here it is.


----------------------------------------------------------------------

Comment By: Oren Tirosh (orenti)
Date: 2002-07-06 10:35

Message:
Logged In: YES 
user_id=562624

This implementation supports both mortal and immortal interned 
strings.

PyString_InternInPlace creates an immortal interned string for 
backward compatibility with code that relies on this behavior.

PyString_Intern creates a mortal interned string that is 
deallocated when its refcnt reaches 0.  Note that if the string 
value has been previously interned as immortal this will not 
make it mortal.

Most places in the interpreter were changed to PyString_Intern 
except those that may be required for compatibility.

This version of the patch, like the previous one, disables 
indirect interning. Is there any evidence that it is still an 
important optimization for some packages?

Make sure you rebuild everything after applying this patch 
because it modifies the size of string object headers.



----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-07-02 00:21

Message:
Logged In: YES 
user_id=80475

I like the way you consolidated all of the knowledge about 
interning into one place.

Consider adding an example to the docs of an effective use 
of interning for optimization.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=576101&group_id=5470