[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 18:59:11 CEST 2005

Martin v. Löwis wrote:

> Walter Dörwald wrote:
> 
>>Martin v. Löwis wrote:
>>
>>>Walter Dörwald wrote:
>>>
>>>>I think a maxsplit argument (just as for unicode.split()) would help
>>>>too.
>>>
>>>Correct - that would allow to get rid of the quadratic part.
>>
>>OK, such a patch should be rather simple. I'll give it a try.
> 
> Actually, on a second thought - it would not remove the quadratic
> aspect.

At least it would remove the quadratic number of calls to 
_PyUnicodeUCS2_IsLinebreak(). For each character it would be called only 
once.

> You would still copy the rest string completely on each
> split. So on the first split, you copy N lines (one result line,
> and N-1 lines into the rest string), on the second split, N-2
> lines, and so on, totalling N*N/2 line copies again.

OK, that's true.

We could prevent string copying if we kept the unsplit string and the 
position of the current line terminator, but this would require a "first 
position after a line terminator" method.

> The only
> thing you save is the join (as the rest is already joined), and
> the IsLineBreak calls (which are necessary only for the first
> line).
> 
> Please see python.org/sf/1268314;

The last part of the patch seems to be more related to bug #1235646.

With the patch test_pep263 and test_codecs fail (and test_parser, but 
this might be unrelated):

python Lib/test/test_pep263.py gives the following output:

File "Lib/test/test_pep263.py", line 22
SyntaxError: list index out of range

test_codecs.py has the following two complaints:

File "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py", 
line 366, in readline
     self.charbuffer = lines[1] + self.charbuffer
IndexError: list index out of range

and

File "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py", 
line 336, in readline
     line = result.splitlines(False)[0]
NameError: global name 'result' is not defined

> it solves the problem by
> keeping the splitlines result. It only invokes IsLineBreak
> once per character, and also copies each character only once,
> and allocates each line only once, totalling in O(N) for
> these operations. It still does contain a quadratic operation:
> the lines are stored in a list, and the result list is
> removed from the list with del lines[0]. This copies N-1
> pointers, result in N*N/2 pointer copies. That should still
> be much faster than the current code.

Using collections.deque() should get rid of this problem.

>>unicodelinebreaks = u"".join(unichr(c) for c in xrange(0,
>>sys.maxunicode) if unichr(c).islinebreak())
> 
> That is very inefficient. I would rather add a static list
> to the string module, and have a test that says
> 
> assert str.unicodelinebreaks == u"".join(ch for ch in (unichr(c) for c
> in xrange(0, sys.maxunicode)) if unicodedata.bidirectional(ch)=='B' or
> unicodedata.category(ch)=='Zl')

You mean, in the test suite?

> unicodelinebreaks could then be defined as
> 
> # u"\r\n\x1c\x1d\x1e\x85\u2028\u2029
> '\n\r\x1c\x1d\x1e\xc2\x85\xe2\x80\xa8\xe2\x80\xa9'.decode("utf-8")

That might be better, as this definition won't change very often.

BTW, why the decode() call? For a Python without unicode?

>>OK, this would mean we'd have to distinguish between a direct call to
>>read() and one done by readline() (which we do anyway through the
>>firstline argument).
> 
> See my patch. If we have cached lines, we don't need to call .read
> at all.

I wonder what happens, if calls to read() and readline() are mixed (e.g. 
if I'm reading Fortran source or anything with a fixed line header): 
read() would be used to read the first n character (which joins the line 
buffer) and readline() reads the rest (which would split it again) etc.
(Of course this could be done via a single readline() call).

But, I think a maxsplit argument for splitlines() woould make sense 
independent of this problem.

Bye,
    Walter Dörwald