[Python-Dev] Auto-str and auto-unicode in join
Nick Coghlan
ncoghlan at iinet.net.au
Fri Aug 27 01:53:19 CEST 2004
Tim Peters wrote:
> I needed a break from intractable database problems, and am almost
> done with PyUnicode_Join(). I'm not doing auto-unicode(), though, so
> there will still be plenty of fun left for Nick!
I actually got that mostly working (off slightly out-of-date CVS though).
Joining a sequence of 10 integers with auto-str seems to take about 60%
of the time of a str(x) list comprehension on that same sequence (and
the PySequence_Fast call means that a generator is slightly slower than
a list comp!). For a sequence which mixed strings and non-strings, the
gains could only increase.
However, there is one somewhat curly problem I'm not sure what to do about.
To avoid slowing down the common case of string join (a list of only
strings) it is necessary to do the promotion to string in the type-check
& size-calculation pass.
That's fine in the case of a list that consists of only strings and
non-basestrings, or the case of a unicode separator - every
non-basestring is converted using either PyObject_Str or PyObject_Unicode.
Where it gets weird is something like this:
''.join([an_int, a_unicode_str])
u''.join([an_int, a_unicode_str])
In the first case, the int will first be converted to a string via
PyObject_Str, and then that string representation is what will get
converted to Unicode after the detection of the unicode string causes
the join to be handed over to Unicode join.
In the latter case, the int is converted directly to Unicode.
So my question would be, is it reasonable to expect that
PyObject_Unicode(PyObject_Str(some_object)) give the same answer as
PyObject_Unicode(some_object)?
If not, then the string join would have to do something whereby it kept
a 'pristine' version of the sequence around to hand over to the Unicode
join.
My first attempt at implementing this feature had that property, but
also had the effect of introducing about a 1% slowdown of the standard
sequence-of-strings case (it introduced an extra if statement to see if
a 'stringisation' pass was required after the initial type checking and
sizing pass). For longer sequences than 10 strings, I imagine the
relative slowdown would be much less.
Hmm. . . I think I see a way to implement this, while still avoiding
adding any code to the standard path through the function. It'd be
slower for the case where an iterator is passed in, and we automatically
invoke PyObject_Str but don't end up delegating to Unicode join, though,
as it involves making a copy of the sequence that only gets used if the
Unicode join is invoked. (If the original object is a real sequence,
rather than an iterator, there is no extra overhead - we have to make
the copy anyway, to avoid mutating the user's sequence).
If people are definitely interested in this feature, I could probably
put a patch together next week.
Regards,
Nick.
More information about the Python-Dev
mailing list