[Python-Dev] Auto-str and auto-unicode in join

Fri Aug 27 01:53:19 CEST 2004

Tim Peters wrote:
> I needed a break from intractable database problems, and am almost
> done with PyUnicode_Join().  I'm not doing auto-unicode(), though, so
> there will still be plenty of fun left for Nick!

I actually got that mostly working (off slightly out-of-date CVS though).

Joining a sequence of 10 integers with auto-str seems to take about 60% 
of the time of a str(x) list comprehension on that same sequence (and 
the PySequence_Fast call means that a generator is slightly slower than 
a list comp!). For a sequence which mixed strings and non-strings, the 
gains could only increase.

However, there is one somewhat curly problem I'm not sure what to do about.

To avoid slowing down the common case of string join (a list of only 
strings) it is necessary to do the promotion to string in the type-check 
& size-calculation pass.

That's fine in the case of a list that consists of only strings and 
non-basestrings, or the case of a unicode separator - every 
non-basestring is converted using either PyObject_Str or PyObject_Unicode.

Where it gets weird is something like this:
     ''.join([an_int, a_unicode_str])
     u''.join([an_int, a_unicode_str])

In the first case, the int will first be converted to a string via 
PyObject_Str, and then that string representation is what will get 
converted to Unicode after the detection of the unicode string causes 
the join to be handed over to Unicode join.

In the latter case, the int is converted directly to Unicode.

So my question would be, is it reasonable to expect that 
PyObject_Unicode(PyObject_Str(some_object)) give the same answer as 
PyObject_Unicode(some_object)?

If not, then the string join would have to do something whereby it kept 
a 'pristine' version of the sequence around to hand over to the Unicode 
join.

My first attempt at implementing this feature had that property, but 
also had the effect of introducing about a 1% slowdown of the standard 
sequence-of-strings case (it introduced an extra if statement to see if 
a 'stringisation' pass was required after the initial type checking and 
sizing pass). For longer sequences than 10 strings, I imagine the 
relative slowdown would be much less.

Hmm. . . I think I see a way to implement this, while still avoiding 
adding any code to the standard path through the function. It'd be 
slower for the case where an iterator is passed in, and we automatically 
invoke PyObject_Str but don't end up delegating to Unicode join, though, 
as it involves making a copy of the sequence that only gets used if the 
Unicode join is invoked. (If the original object is a real sequence, 
rather than an iterator, there is no extra overhead - we have to make 
the copy anyway, to avoid mutating the user's sequence).

If people are definitely interested in this feature, I could probably 
put a patch together next week.

Regards,
Nick.