[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 20:16:39 CEST 2005

Martin v. Löwis wrote:

> Walter Dörwald wrote:
> 
>>At least it would remove the quadratic number of calls to
>>_PyUnicodeUCS2_IsLinebreak(). For each character it would be called only
>>once.
> 
> Correct. However, I very much doubt that this is the cause of the
> slowdown.

Probably. We'd need a test with the original Argon source to really know.

>>The last part of the patch seems to be more related to bug #1235646.
> 
> You mean the last chunk (linebuffer=None)? This is just the extension
> to reset.

Ouch, you're right: The part of "cvs diff" was part of my checkout, not 
your patch. I have so many Python checkouts, that I sometimes forget 
which is which! ;)

>>With the patch test_pep263 and test_codecs fail (and test_parser, but
>>this might be unrelated):
> 
> Oops, I thought I ran the test suite, but apparently with the patch
> removed. New version uploaded.

Looks much better now.

>>Using collections.deque() should get rid of this problem.
> 
> Alright. There are so many types in Python I've never heard of :-)

The problem is that unicode.splitlines() returns a list, so the push/pop 
performance advantange of collections.deque might be eaten by having to 
create a collections.deque object in the first place.

>>You mean, in the test suite?
> 
> Right.
> 
>>BTW, why the decode() call? For a Python without unicode?
> 
> Right. Not sure what people think whether this should still be
> supported, but I keep supporting it whenever I think of it.

OK, so should we add this for 2.4.2 or only for 2.5?

Should this really be put into string.py, or should it be a class 
attribute of unicode? (At least that's what was proposed for the other 
strings in string.py (string.whitespace etc.) too.

>>I wonder what happens, if calls to read() and readline() are mixed (e.g.
>>if I'm reading Fortran source or anything with a fixed line header):
>>read() would be used to read the first n character (which joins the line
>>buffer) and readline() reads the rest (which would split it again) etc.
>>(Of course this could be done via a single readline() call).
> 
> Then performance would drop again - it should still be correct, though.
> 
> If this is becomes a frequent problem, we could satisfy read requests
> from the split lines as well (i.e. join as many lines as you need).
> However, I would rather expect that callers of read() typically want
> the entire file, or want to read in large chunks (with no line
> orientation at all).

Agreed! Don't fix a bug that hasn't been reported! ;)

>>But, I think a maxsplit argument for splitlines() woould make sense
>>independent of this problem.
> 
> I'm not so sure anymore. It is good for consistency, but I doubt there
> are actual use cases: how often do you want only the first n lines
> of some string? Reading the first n lines of a file might be an
> application, but then, you would rather use .readline() directly.

Not every unicode string is read from a StreamReader.

> For readline, I don't think there is a clear case for splitting of
> only the first line (unless you want to return an index instead of
> the rest string): if the application eventually wants all of the
> data, we better split it right away into individual strings, instead
> of dealing with a gradually decreasing trailer.

True, this would be best for a readline loop.

Another solution would be to have a unicode.itersplitlines() and store 
the iterator. Then we wouldn't need a maxsplit because you simply can 
stop iterating once you have what you want.

> Anyway, I don't think we should go back to C's readline/fgets. This
> is just too messy wrt. buffering and text vs. binary mode.

I don't know about C's readline, but StreamReader.read() and 
StreamReader.readline() are messy enough. But at least it's something we 
can fix ourselves.

> I wish
> Python would stop using stdio entirely.

So reverting to the 2.3 behaviour for simple codecs is out?

Bye,
    Walter Dörwald