[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Walter Dörwald
walter at livinglogic.de
Wed Aug 24 18:59:11 CEST 2005
Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>>Martin v. Löwis wrote:
>>
>>>Walter Dörwald wrote:
>>>
>>>>I think a maxsplit argument (just as for unicode.split()) would help
>>>>too.
>>>
>>>Correct - that would allow to get rid of the quadratic part.
>>
>>OK, such a patch should be rather simple. I'll give it a try.
>
> Actually, on a second thought - it would not remove the quadratic
> aspect.
At least it would remove the quadratic number of calls to
_PyUnicodeUCS2_IsLinebreak(). For each character it would be called only
once.
> You would still copy the rest string completely on each
> split. So on the first split, you copy N lines (one result line,
> and N-1 lines into the rest string), on the second split, N-2
> lines, and so on, totalling N*N/2 line copies again.
OK, that's true.
We could prevent string copying if we kept the unsplit string and the
position of the current line terminator, but this would require a "first
position after a line terminator" method.
> The only
> thing you save is the join (as the rest is already joined), and
> the IsLineBreak calls (which are necessary only for the first
> line).
>
> Please see python.org/sf/1268314;
The last part of the patch seems to be more related to bug #1235646.
With the patch test_pep263 and test_codecs fail (and test_parser, but
this might be unrelated):
python Lib/test/test_pep263.py gives the following output:
File "Lib/test/test_pep263.py", line 22
SyntaxError: list index out of range
test_codecs.py has the following two complaints:
File "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py",
line 366, in readline
self.charbuffer = lines[1] + self.charbuffer
IndexError: list index out of range
and
File "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py",
line 336, in readline
line = result.splitlines(False)[0]
NameError: global name 'result' is not defined
> it solves the problem by
> keeping the splitlines result. It only invokes IsLineBreak
> once per character, and also copies each character only once,
> and allocates each line only once, totalling in O(N) for
> these operations. It still does contain a quadratic operation:
> the lines are stored in a list, and the result list is
> removed from the list with del lines[0]. This copies N-1
> pointers, result in N*N/2 pointer copies. That should still
> be much faster than the current code.
Using collections.deque() should get rid of this problem.
>>unicodelinebreaks = u"".join(unichr(c) for c in xrange(0,
>>sys.maxunicode) if unichr(c).islinebreak())
>
> That is very inefficient. I would rather add a static list
> to the string module, and have a test that says
>
> assert str.unicodelinebreaks == u"".join(ch for ch in (unichr(c) for c
> in xrange(0, sys.maxunicode)) if unicodedata.bidirectional(ch)=='B' or
> unicodedata.category(ch)=='Zl')
You mean, in the test suite?
> unicodelinebreaks could then be defined as
>
> # u"\r\n\x1c\x1d\x1e\x85\u2028\u2029
> '\n\r\x1c\x1d\x1e\xc2\x85\xe2\x80\xa8\xe2\x80\xa9'.decode("utf-8")
That might be better, as this definition won't change very often.
BTW, why the decode() call? For a Python without unicode?
>>OK, this would mean we'd have to distinguish between a direct call to
>>read() and one done by readline() (which we do anyway through the
>>firstline argument).
>
> See my patch. If we have cached lines, we don't need to call .read
> at all.
I wonder what happens, if calls to read() and readline() are mixed (e.g.
if I'm reading Fortran source or anything with a fixed line header):
read() would be used to read the first n character (which joins the line
buffer) and readline() reads the rest (which would split it again) etc.
(Of course this could be done via a single readline() call).
But, I think a maxsplit argument for splitlines() woould make sense
independent of this problem.
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list