[Python-Dev] Auto-str and auto-unicode in join

Tim Peters tim.peters at gmail.com
Sun Aug 29 03:51:48 CEST 2004


If we were to do auto-str, it would be better to rewrite str.join() as
a 1-pass algorithm, using the kind of "double allocated space as
needed" gimmick unicode.join uses.  It would be less efficient if
auto-promotion to Unicode turns out to be required, but it's hard to
measure how little I care about that; it might be faster if auto-str
and Unicode promotion aren't needed (as only 1 pass would be needed).

auto-str couldn't really *mean* string.join(map(str, seq)) either. 
The problem with the latter is that if a seq element x is a unicode
instance, str(x) will convert it into an encoded (8-bit) str, which
would not be backward compatible.  So the logic would be more (in
outline):

class string:
    def join(self, seq):
        seq = PySequence_Fast(seq)
        if seq is NULL:
            return NULL

        if len(seq) == 0:
            return ""
        elif len(seq) == 1 and type(seq[0]) is str:
            return seq[0]

        allocate a string object with (say) 100 bytes of space
        let p point to the first free byte

        for x in seq:
            if type(x) is str:
                copy x's guts into p, getting more space if needed
            elif isinstance(x, unicode):
                return unicode,join(self, seq)
            else:
                x = PyObject_Str(x)
                if x is NULL:
                    return NULL
                copy x's guts into p, etc

            if not the last element:
                copy the separator's guts into p, etc

        cut p back to the space actually used
        return p's string object

Note a peculiarity:  if x is neither str nor unicode, but has a
__str__ or __repr__ method that returns a unicode object,
PyObject_Str() will convert that into an 8-bit str.  That may be
surprising.  It would be ugly to duplicate most of the logic from
PyObject_Unicode() to try to guess whether there's "a natural" Unicode
spelling of x.  I think I'd rather say "tough luck -- use unicode.join
if that's what you want".


More information about the Python-Dev mailing list