[Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test
test_string.py, 1.25, 1.26
walter at livinglogic.de
Thu Aug 26 21:54:34 CEST 2004
Tim Peters wrote:
> [Walter Dörwald]
>>I'm working on it, however I discovered that unicode.join()
>>doesn't optimize this special case:
>>s = "foo"
>>assert "".join([s]) is s
>>u = u"foo"
>>assert u"".join([s]) is s
>>The second assertion fails.
> Well, in that example it *has* to fail, because the input (s) wasn't a
> unicode string to begin with, but u"".join() must return a unicode
> string. Maybe you intended to say that
> assert u"".join([u]) is u
Argl, you're right.
> (which is also true today, but doesn't need to be true tomorrow).
I've removed the test today, so it won't fail tomorrow. ;)
>>I'd say that this test (joining a one item sequence returns
>>the item itself) should be removed because it tests an
> Neverthess, it's an important pragmatic detail. We should never throw
> away a test just because rearrangement makes a test less convenient.
So, should I put the test back in (in test_str.py)?
>>I'm not sure, whether the optimization should be added to
> Believing you mean join(), yes.
Unfortunately the implementations of str.join and unicode.join
look completely different. str.join does a PySequence_Fast() and
then tests whether the sequence length is 0 or 1, unicode.join
iterates through the argument via PyObject_GetIter/PyIter_Next.
Adding the optimization might result in a complete rewrite of
> Doing common endcases efficiently in
> C code is an important quality-of-implementation concern, lest people
> need to add reams of optimization test-&-branch guesses in their own
> Python code. For example, the SpamBayes tokenizer has many passes
> that split input strings on magical separators of one kind or another,
> pasting the remaining pieces together again via string.join(). It's
> explicitly noted in the code that special-casing the snot out of
> "separator wasn't found" in Python is a lot slower than letting
> string.join(single_element_list) just return the list element, so that
> simple, uniform Python code works well in all cases. It's expected
> that *most* of these SB passes won't find the separator they're
> looking for, and it's important not to make endless copies of
> unboundedly large strings in the expected case. The more heavily used
> unicode strings become, the more important that they treat users
> kindly in such cases too.
Seems like we have to rewrite PyUnicode_Join().
More information about the Python-Dev