
Interning is done using a flag instead of a pointer (3 bytes less). The ob_sinterned pointer was most of the time either NULL or pointing to the same object. Cases where it pointed to another object were rare and the code that was cheching for this case was not effective.
Interned strings are no longer immortal. They die when their refcnt reaches 0 just like any other object. The reference from the interned dict will not keep them alive longer than necessary.
Can anyone explain why they were implemented with a pointer in the first place? Barry?
Oren

[Oren Tirosh, on http://python.org/sf/576101]
... Interned strings are no longer immortal. They die when their refcnt reaches 0 just like any other object.
This may be a problem. Code now can rely on that id(some_interned_string) stays the same across the life of a run.
... Can anyone explain why they were implemented with a pointer in the first place? Barry?
It will have to be Guido. He made a plausible case to me once about why the indirection is there, but it may be an optimization that's no longer important. At the time interned strings were introduced, extension modules had mountains of code of the form:
/* at module init time, in one or more modules */ static PyObject *spam_str = PyString_FromString("spam");
/* in various module routines */ PyObject_SetAttr(someobject, spam_str, user_supplied_value);
and PyObject_SetAttr() was changed to make spam_str what you called an "indirectly interned" string by magic. This was (or at least Guido thought it was <wink>) an important optimization at the time.
Extension modules written after interned strings were introduced can exploit interning directly, a la
/* at module init time, in one or more modules */ static PyObject *spam_str = PyString_InternFromString("spam");
and the core was reworked to do that too (note that this optimization wasn't directed at the core -- it could well be that core code never creates an indirectly interned string). I don't know how many extension modules still implicitly rely on indirect interning for a speed boost. Zope doesn't, and that's all that really matters <wink>.

On 1 Jul 2002 at 17:12, Tim Peters wrote:
... I don't know how many extension modules still implicitly rely on indirect interning for a speed boost.
I bet most extension authors have been completely ignorant of it, which makes the answer "most of them" <wink>.
-- Gordon http://www.mcmillan-inc.com/

[Gordon, on extension modules implicitly relying on indirect interning]
I bet most extension authors have been completely ignorant of it, which makes the answer "most of them" <wink>.
Could be! I don't know how much of a speed boost they get, though. While the magical interning is done for PyObject_SetAttr(), it's not done for the has-to-be-more-frequently-called PyObject_GetAttr(), as people call that with all sorts of garbage strings. For some reason interning is done for PyObject_GetAttrString(), although the caller of that can't profit from indirect interning (it takes a char*, not a PyObject*).
Like I said, maybe this all makes sense to Guido <0.9 wink>.
at-least-we're-not-fighting-over-what-the-comments-mean-ly y'rs - tim

On Mon, Jul 01, 2002 at 05:12:31PM -0400, Tim Peters wrote:
[Oren Tirosh, on http://python.org/sf/576101]
... Interned strings are no longer immortal. They die when their refcnt reaches 0 just like any other object.
This may be a problem. Code now can rely on that id(some_interned_string) stays the same across the life of a run.
This requires code that stores the id of an object without keeping a reference to the actual object. It also requires that no other piece of Python or C code keep a reference to that object and yet for its identity to be somehow still significant. If find that extremely hard to imagine.
Can anyone explain why they were implemented with a pointer in the first place? Barry?
...
and PyObject_SetAttr() was changed to make spam_str what you called an "indirectly interned" string by magic. This was (or at least Guido thought it was <wink>) an important optimization at the time.
I see. As far as I can tell, it isn't any more.
Now for something a bit more radical:
Why not make interned strings a type? <type 'istr'> could be an un-subclassable subclass of string. intern would just be an alias for this type. No two istr instances are equal unless they are identical. I guess PyString_CheckExact would need to be changed to accept either String or InternedString.
Oren

[Tim]
This may be a problem. Code now can rely on that id(some_interned_string) stays the same across the life of a run.
[Oren Tirosh]
This requires code that stores the id of an object without keeping a reference to the actual object. It also requires that no other piece of Python or C code keep a reference to that object and yet for its identity to be somehow still significant. If find that extremely hard to imagine.
I would have guessed you had a more vivid imagination <wink>. It's precisely because the id has been guaranteed that a program may not care to save a reference to an interned string. For example,
""" _ids = map(id, map(intern, "if then elif else".split())) TOKEN_IF, TOKEN_THEN, TOKEN_ELIF, TOKEN_ELSE, TOKEN_NAME = range(5) id2token = dict(zip(_ids, range(4))) del _ids
def tokenvector(s): return [id2token.get(id(intern(word)), TOKEN_NAME) for word in s.split()]
print tokenvector("if this is the example, then what's the question?") """
This works reliably today to classify tokens. I'm not certain I'd care if it broke, but we have to consider that it hasn't been difficult to write code that would break.
This was (or at least Guido thought it was <wink>) an important optimization at the time.
I see. As far as I can tell, it isn't any more.
Which extension modules have you investigated? The claim is too vague to carry weight. Zope's C code uses the interned-string C API directly, so it doesn't matter to Zope code. That's all I've looked at. Making a case that the optimization is no longer important requires investigating code.
Now for something a bit more radical:
Why not make interned strings a type? <type 'istr'> could be an un-subclassable subclass of string. intern would just be an alias for this type. No two istr instances are equal unless they are identical. I guess PyString_CheckExact would need to be changed to accept either String or InternedString.
What would the point be? That is, instead of "why not?", why? As to "why not?", there's something about elevating what's basically an optimization hack to a type that makes me squirm.

On Tue, Jul 02, 2002 at 03:31:03PM -0400, Tim Peters wrote:
I would have guessed you had a more vivid imagination <wink>. It's precisely because the id has been guaranteed that a program may not care to save a reference to an interned string. For example,
""" _ids = map(id, map(intern, "if then elif else".split())) TOKEN_IF, TOKEN_THEN, TOKEN_ELIF, TOKEN_ELSE, TOKEN_NAME = range(5) id2token = dict(zip(_ids, range(4))) del _ids
def tokenvector(s): return [id2token.get(id(intern(word)), TOKEN_NAME) for word in s.split()]
print tokenvector("if this is the example, then what's the question?") """
This works reliably today to classify tokens. I'm not certain I'd care if it broke, but we have to consider that it hasn't been difficult to write code that would break.
Ironically, this code is actually slower than using the strings themselves as keys (interned or not). But I get the point.
Now for something a bit more radical:
Why not make interned strings a type? <type 'istr'> could be an un-subclassable subclass of string. intern would just be an alias for this type. No two istr instances are equal unless they are identical. I guess PyString_CheckExact would need to be changed to accept either String or InternedString.
What would the point be? That is, instead of "why not?", why? As to "why not?", there's something about elevating what's basically an optimization hack to a type that makes me squirm.
Change the name from 'istr' to 'symbol' and add a mild case of language envy and you'll see why ;-)
Oren

Oren Tirosh oren-py-d@hishome.net:
Tim Peters:
What would the point be? That is, instead of "why not?", why? As to "why not?", there's something about elevating what's basically an optimization hack to a type that makes me squirm.
Change the name from 'istr' to 'symbol' and add a mild case of language envy and you'll see why ;-)
But in Lisp, symbols and strings really are completely separate types. That's not the case in Python, and you still haven't really given a reason why they should be.
Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Oren Tirosh wrote:
Why not make interned strings a type? <type 'istr'> could be an un-subclassable subclass of string. intern would just be an alias for this type. No two istr instances are equal unless they are identical. I guess PyString_CheckExact would need to be changed to accept either String or InternedString.
The possibility of people starting to write code that depended on whether strings were 'string' or 'istr', and all the breakage and incompatibility that would result, seems much too ugly to contemplate. Pass an 'istr' into a routine that expects strings, and it would appear to be a string right up until someone tried to == it, whereupon all hell would break loose.
The acid test for subtyping is substitutability: type 'istr' would not fulfill the contract of 'string', and neither would 'string' fulfill the contract of 'istr'. Therefore, if you really wanted to do this, your new type (let's call it 'symbol') would have to be completely independent from both strings *and* interned strings. There's no subclass relationship.
-- ?!ng

On Tue, Jul 02, 2002 at 11:07:14PM -0700, Ka-Ping Yee wrote:
Oren Tirosh wrote:
Why not make interned strings a type? <type 'istr'> could be an un-subclassable subclass of string. intern would just be an alias for this type. No two istr instances are equal unless they are identical. I guess PyString_CheckExact would need to be changed to accept either String or InternedString.
The possibility of people starting to write code that depended on whether strings were 'string' or 'istr', and all the breakage and incompatibility that would result, seems much too ugly to contemplate. Pass an 'istr' into a routine that expects strings, and it would appear to be a string right up until someone tried to == it, whereupon all hell would break loose.
I don't understand your assumptions. What kind of hell? Are you assuming that == would be equivalent to 'is' for istrs? The == operator should work exactly the same, just possibly a little faster when comparing two istrs.
The acid test for subtyping is substitutability: type 'istr' would not fulfill the contract of 'string', and neither would 'string' fulfill the contract of 'istr'.
Can you be more specific? As i see it an istr would be completely compatible to str with the exception of being non subclassable.
It has the additional property that
(type(s) is istr and type(t) is istr and s == t) implies (s is t).
But that doesn't break anything.
Oren

On Wed, 3 Jul 2002, Oren Tirosh wrote:
On Tue, Jul 02, 2002 at 11:07:14PM -0700, Ka-Ping Yee wrote:
Oren Tirosh wrote:
No two istr instances are equal unless they are identical. I guess PyString_CheckExact would need to be changed to accept either String or InternedString.
[...]
Pass an 'istr' into a routine that expects strings, and it would appear to be a string right up until someone tried to == it, whereupon all hell would break loose.
I don't understand your assumptions.
I just went on what you wrote: "No two istr instances are equal unless they are identical." I read that to mean that == would be implemented with pointer comparison, which would break contracts the way i described. I see now that is not what you meant.
It appears that what you are proposing is what interned string comparison already does (since == checks for pointer equality first). So, the only observable effect of the change would be to break all code that tests for type(s) == str.
-- ?!ng

On Wed, Jul 03, 2002 at 02:33:24AM -0700, Ka-Ping Yee wrote:
I just went on what you wrote: "No two istr instances are equal unless they are identical." I read that to mean that == would be implemented with pointer comparison, which would break contracts the way i described. I see now that is not what you meant.
If all dutchmen like Monty Python it doesn't mean that anyone who likes Monty Python is a dutchman.
It appears that what you are proposing is what interned string comparison already does (since == checks for pointer equality first).
But INequality checking may still require strcmp. Inverse logic again.
So, the only observable effect of the change would be to break all code that tests for type(s) == str.
Yes, that's certainly a problem.
This thought experiment is part of a strange fantasy I have that Python might one day use only interned strings to represent names. There are relatively few places where a string may be converted to a name (getattr, hasattr, etc) and these could be interned at the interface if interned strings are not immortal. I expect that nothing will ever come out of this, but it's fun to think about it anyway...
Oren

On Wed, 3 Jul 2002, Oren Tirosh wrote:
It appears that what you are proposing is what interned string comparison already does (since == checks for pointer equality first).
But INequality checking may still require strcmp. Inverse logic again.
I never claimed it wouldn't. All i'm saying is that string comparison already does this: compare pointers, then if not equal, compare strings.
So, the only observable effect of the change would be to break all code that tests for type(s) == str.
Yes, that's certainly a problem.
But you haven't responded to my point. Would there be *any* effect other than breakage?
-- ?!ng

On Wed, Jul 03, 2002 at 03:14:33AM -0700, Ka-Ping Yee wrote:
On Wed, 3 Jul 2002, Oren Tirosh wrote:
It appears that what you are proposing is what interned string comparison already does (since == checks for pointer equality first).
But INequality checking may still require strcmp. Inverse logic again.
I never claimed it wouldn't. All i'm saying is that string comparison already does this: compare pointers, then if not equal, compare strings.
So, the only observable effect of the change would be to break all code that tests for type(s) == str.
Yes, that's certainly a problem.
But you haven't responded to my point. Would there be *any* effect other than breakage?
The warm fuzzy feeling that you have a real symbol type :-)
Just for the record: I am not a LISP zealot.
Oren

The warm fuzzy feeling that you have a real symbol type :-)
Doesn't give me a warm fuzzy feeling at all. A symbol type is just another compiler implementation detail IMO. Strings are natural to designate identifiers.
--Guido van Rossum (home page: http://www.python.org/~guido/)

On Sat, Jul 13, 2002 at 09:34:19AM -0400, Guido van Rossum wrote:
The warm fuzzy feeling that you have a real symbol type :-)
Doesn't give me a warm fuzzy feeling at all. A symbol type is just another compiler implementation detail IMO. Strings are natural to designate identifiers.
Making interned strings a type was just idle speculation, don't take it too seriously...
Oren

"OT" == Oren Tirosh oren-py-d@hishome.net writes:
OT> This thought experiment is part of a strange fantasy I have that OT> Python might one day use only interned strings to represent OT> names. There are relatively few places where a string may be OT> converted to a name (getattr, hasattr, etc) and these could be OT> interned at the interface if interned strings are not OT> immortal. I expect that nothing will ever come out of this, but OT> it's fun to think about it anyway...
two responses:
What do you mean by "represent names"? Code objects already use interned strings for names. Did you have something else in mind?
You might have mentioned this thought experiment / strange fantasy at the outset of the thread <0.2 wink>. There was a lot of email thrashing on this subject, but none of it apeears to have been necessary.
Jeremy

On Wed, Jul 03, 2002 at 02:17:12PM -0400, Jeremy Hylton wrote:
"OT" == Oren Tirosh oren-py-d@hishome.net writes:
OT> This thought experiment is part of a strange fantasy I have that OT> Python might one day use only interned strings to represent OT> names. There are relatively few places where a string may be OT> converted to a name (getattr, hasattr, etc) and these could be OT> interned at the interface if interned strings are not OT> immortal. I expect that nothing will ever come out of this, but OT> it's fun to think about it anyway...
two responses:
What do you mean by "represent names"? Code objects already use interned strings for names. Did you have something else in mind?
Not something else - just more of the same. Interned names in co_names tuples are a good start but there are tons of places where literal C-strings are used such as in descriptors. These names are converted to temporary Python strings on demand. My humble goal is for any name that has a predefined meaning in Python to appear exactly once in the executable and that instance will be in the form of a static preinitialized Python string object, not a C string literal.
Here's how it might work: to use the name 'foo' you just refer to the C name PYSYMfoo. During build a helper program scans all C sources for names starting with PYSYM and automatically generates a .c file where each of these names appears once as a pre-initialized string object and an .h file included by Python.h. On startup all these string objects are interned, of course.
So any name used from C is resolved by the linker to point to the interned single instance. Any name appearing unquoted in Python code is interned when when it's compiled or loaded from the .pyc file. There are some cases where a string becomes a name such as the arguments to functions like getattr and hasattr. These would need to be interned before reaching the 100% interned core of the language. I guess this could be done by a new PyArgs_ParseTuple format char. This obviously requires interned strings to be non-immortal.
For example:
if (strcmp(sname, "__class__") == 0) becomes if (if sname == PYSYM__class__)
This is a pretty trivial example but I have other ideas for optimizations and cleanups that this would enable. These might lead to significant improvements in code size and performance.
Well, that's my fantasy. There are still some "minor" problems like totally breaking the C API.
Oren

"KY" == Ka-Ping Yee ping@zesty.ca writes:
KY> It appears that what you are proposing is what interned string KY> comparison already does (since == checks for pointer equality KY> first). So, the only observable effect of the change would be KY> to break all code that tests for type(s) == str.
Shouldn't those already be written as isinstance(s, str)? Maybe with StringType for str?
Even so, I'm not much in favor of adding more string types to the language. I think we should be /collapsing/ string types not proliferating them (i.e. removing the distinction between str and unicode -- Jython seems to get by just fine that way).
-Barry

On Monday, July 1, 2002, at 11:12 , Tim Peters wrote:
[Oren Tirosh, on http://python.org/sf/576101]
... Interned strings are no longer immortal. They die when their refcnt reaches 0 just like any other object.
This may be a problem. Code now can rely on that id(some_interned_string) stays the same across the life of a run.
The macimport code relies on the ids remaining the same. But it is easy to fix (just add an incref). I'll also change it to use PyString_CheckInterned. -- - Jack Jansen Jack.Jansen@oratrix.com http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

On Tue, Jul 02, 2002 at 11:17:15AM +0200, Jack Jansen wrote:
On Monday, July 1, 2002, at 11:12 , Tim Peters wrote:
[Oren Tirosh, on http://python.org/sf/576101]
... Interned strings are no longer immortal. They die when their refcnt reaches 0 just like any other object.
This may be a problem. Code now can rely on that id(some_interned_string) stays the same across the life of a run.
The macimport code relies on the ids remaining the same. But it is easy to fix (just add an incref). I'll also change it to use PyString_CheckInterned.
No, an incref there would leak references. Nothing needs to be changed.
Any code with correct reference counting will not notice any difference with this patch. The only problem that could occur is if Python code uses the id function, stores the integer result but doesn't keep an actual reference to the string object and no other code does, either. Even this is not a problem yet unless the code also expects that if the same string is ever interned again it will get the same integer id and breaks if it doesn't. I can't believe anyone is stupid enough to do that. Using the id function this way is equivalent to an uncounted reference.
BTW, my patch already takes care of PyString_CheckInterned in macimport.c
Oren

Oren Tirosh wrote:
Even this is not a problem yet unless the code also expects that if the same string is ever interned again it will get the same integer id and breaks if it doesn't. I can't believe anyone is stupid enough to do that.
do what? trust the documentation?
intern(string) Enter string in the table of ``interned'' strings and return the interned string - which is string itself or a copy. /.../ Interned strings are immortal (never get garbage collected). </F>

On Tuesday, Jul 2, 2002, at 12:37 Europe/Amsterdam, Oren Tirosh wrote:
The macimport code relies on the ids remaining the same. But it is easy to fix (just add an incref). I'll also change it to use PyString_CheckInterned.
No, an incref there would leak references. Nothing needs to be changed.
Uhm... I'm confused: macimport stores a pointer to the object if it's interned (the object in question is one of the strings in sys.path). It didn't INCREF the object, and that wasn't needed up until now because interned objects can never go away. However, if they can go away I would think that storing a pointer would definitely call for an INCREF...

On Tue, Jul 02, 2002 at 03:25:20PM +0200, Jack Jansen wrote:
On Tuesday, Jul 2, 2002, at 12:37 Europe/Amsterdam, Oren Tirosh wrote:
The macimport code relies on the ids remaining the same. But it is easy to fix (just add an incref). I'll also change it to use PyString_CheckInterned.
No, an incref there would leak references. Nothing needs to be changed.
Uhm... I'm confused: macimport stores a pointer to the object if it's interned (the object in question is one of the strings in sys.path). It didn't INCREF the object, and that wasn't needed up until now because interned objects can never go away. However, if they can go away I would think that storing a pointer would definitely call for an INCREF...
Are you saying that this code is not following reference counting rules and got away with it only because interned strings are immortal?
I don't see how adding only an incref could be correct - there must be a corresponding decref somewhere.
Oren

On Tuesday, July 2, 2002, at 03:57 , Oren Tirosh wrote:
Uhm... I'm confused: macimport stores a pointer to the object if it's interned (the object in question is one of the strings in sys.path). It didn't INCREF the object, and that wasn't needed up until now because interned objects can never go away. However, if they can go away I would think that storing a pointer would definitely call for an INCREF...
Are you saying that this code is not following reference counting rules and got away with it only because interned strings are immortal?
I'm afraid so. Or, actually, "afraid so" sounds too apologetic:-): interned strings were specifically defined to be immortal.
I don't see how adding only an incref could be correct - there must be a corresponding decref somewhere.
No, there isn't, because this list of pointers is never cleared. Which was never needed, because they were borrowed references.
Again, it isn't rocket science to fix this: _PyImport_Fini() will need to call out to a new routine _PyMacImport_Fini() that DECREFs the stored pointers. -- - Jack Jansen Jack.Jansen@oratrix.com http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

On Tue, Jul 02, 2002 at 04:37:42PM +0200, Jack Jansen wrote:
On Tuesday, July 2, 2002, at 03:57 , Oren Tirosh wrote:
Uhm... I'm confused: macimport stores a pointer to the object if it's interned (the object in question is one of the strings in sys.path). It didn't INCREF the object, and that wasn't needed up until now because interned objects can never go away. However, if they can go away I would think that storing a pointer would definitely call for an INCREF...
Are you saying that this code is not following reference counting rules and got away with it only because interned strings are immortal?
I'm afraid so. Or, actually, "afraid so" sounds too apologetic:-): interned strings were specifically defined to be immortal.
I know it says so in the doc, but I always tended to look at it as an implementation limitation rather than a feature...
Oren

[Jack Jansen]
I'm afraid so. Or, actually, "afraid so" sounds too apologetic:-): interned strings were specifically defined to be immortal.
[Oren Tirosh]
I know it says so in the doc, but I always tended to look at it as an implementation limitation rather than a feature...
Me too: I always read it as a warning not to use interning "too much". However, you can see how far common sense goes once users get ahold of a thing <wink>.
participants (10)
-
barry@zope.com
-
Fredrik Lundh
-
Gordon McMillan
-
Greg Ewing
-
Guido van Rossum
-
Jack Jansen
-
Jeremy Hylton
-
Ka-Ping Yee
-
Oren Tirosh
-
Tim Peters