[Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test test_string.py, 1.25, 1.26

Thu Aug 26 21:54:34 CEST 2004

Tim Peters wrote:

> [Walter Dörwald]
> 
>>I'm working on it, however I discovered that unicode.join()
>>doesn't optimize this special case:
>>
>>s = "foo"
>>assert "".join([s]) is s
>>
>>u = u"foo"
>>assert u"".join([s]) is s
>>
>>The second assertion fails.
> 
> Well, in that example it *has* to fail, because the input (s) wasn't a
> unicode string to begin with, but u"".join() must return a unicode
> string.  Maybe you intended to say that
> 
>     assert u"".join([u]) is u
> 
> fails

Argl, you're right.

> (which is also true today, but doesn't need to be true tomorrow).

I've removed the test today, so it won't fail tomorrow. ;)

>>I'd say that this test (joining a one item sequence returns
>>the item itself) should be removed because it tests an
>>implementation detail.
> 
> Neverthess, it's an important pragmatic detail.  We should never throw
> away a test just because rearrangement makes a test less convenient.

So, should I put the test back in (in test_str.py)?

>>I'm not sure, whether the optimization should be added to
>>unicode.find().
> 
> Believing you mean join(), yes.

Unfortunately the implementations of str.join and unicode.join
look completely different. str.join does a PySequence_Fast() and
then tests whether the sequence length is 0 or 1, unicode.join
iterates through the argument via PyObject_GetIter/PyIter_Next.

Adding the optimization might result in a complete rewrite of
PyUnicode_Join().

> Doing common endcases efficiently in
> C code is an important quality-of-implementation concern, lest people
> need to add reams of optimization test-&-branch guesses in their own
> Python code.  For example, the SpamBayes tokenizer has many passes
> that split input strings on magical separators of one kind or another,
> pasting the remaining pieces together again via string.join().  It's
> explicitly noted in the code that special-casing the snot out of
> "separator wasn't found" in Python is a lot slower than letting
> string.join(single_element_list) just return the list element, so that
> simple, uniform Python code works well in all cases.  It's expected
> that *most* of these SB passes won't find the separator they're
> looking for, and it's important not to make endless copies of
> unboundedly large strings in the expected case.  The more heavily used
> unicode strings become, the more important that they treat users
> kindly in such cases too.

Seems like we have to rewrite PyUnicode_Join().

Bye,
    Walter Dörwald