[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
"Martin v. Löwis"
martin at v.loewis.de
Wed Aug 24 19:38:54 CEST 2005
Walter Dörwald wrote:
> At least it would remove the quadratic number of calls to
> _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only
Correct. However, I very much doubt that this is the cause of the
> The last part of the patch seems to be more related to bug #1235646.
You mean the last chunk (linebuffer=None)? This is just the extension
> With the patch test_pep263 and test_codecs fail (and test_parser, but
> this might be unrelated):
Oops, I thought I ran the test suite, but apparently with the patch
removed. New version uploaded.
> Using collections.deque() should get rid of this problem.
Alright. There are so many types in Python I've never heard of :-)
> You mean, in the test suite?
> BTW, why the decode() call? For a Python without unicode?
Right. Not sure what people think whether this should still be
supported, but I keep supporting it whenever I think of it.
> I wonder what happens, if calls to read() and readline() are mixed (e.g.
> if I'm reading Fortran source or anything with a fixed line header):
> read() would be used to read the first n character (which joins the line
> buffer) and readline() reads the rest (which would split it again) etc.
> (Of course this could be done via a single readline() call).
Then performance would drop again - it should still be correct, though.
If this is becomes a frequent problem, we could satisfy read requests
from the split lines as well (i.e. join as many lines as you need).
However, I would rather expect that callers of read() typically want
the entire file, or want to read in large chunks (with no line
orientation at all).
> But, I think a maxsplit argument for splitlines() woould make sense
> independent of this problem.
I'm not so sure anymore. It is good for consistency, but I doubt there
are actual use cases: how often do you want only the first n lines
of some string? Reading the first n lines of a file might be an
application, but then, you would rather use .readline() directly.
For readline, I don't think there is a clear case for splitting of
only the first line (unless you want to return an index instead of
the rest string): if the application eventually wants all of the
data, we better split it right away into individual strings, instead
of dealing with a gradually decreasing trailer.
Anyway, I don't think we should go back to C's readline/fgets. This
is just too messy wrt. buffering and text vs. binary mode. I wish
Python would stop using stdio entirely.
More information about the Python-Dev