[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
walter at livinglogic.de
Wed Aug 24 22:14:32 CEST 2005
Am 24.08.2005 um 21:15 schrieb Martin v. Löwis:
> Walter Dörwald wrote:
>>> Right. Not sure what people think whether this should still be
>>> supported, but I keep supporting it whenever I think of it.
>> OK, so should we add this for 2.4.2 or only for 2.5?
> You mean, string.unicodelinebreaks?
> I think something needs to be
> done to fix the performance problem. In doing so, API changes
> might occur. We should not add API changes in 2.4.2 unless they
> contribute to the bug fix, and even then, the release manager
> probably needs to approve them (in any case, they certainly
> need to be backwards compatible)
OK. Your version of the patch (without replacing line =
line.splitlines(False) with something better) might be enough for
>> Should this really be put into string.py, or should it be a class
>> attribute of unicode? (At least that's what was proposed for the
>> strings in string.py (string.whitespace etc.) too.
> If the 2.4.2 fix is based on this kind of data, I think it should go
> into a private attribute of codecs.py.
I think codecs.unicodelinebreaks has one big problem: it will not
work for codecs that do str->str decoding.
> For 2.5, I would put it
> into strings for tradition. There is no point in having some of these
> constants in strings and others as class attributes (unless we also
> add them as class attributes in 2.5, in which case adding
> unicodelinebreaks into strings would be pointless).
> So I think in 2.5, I would like to see
> # string.py
> ascii_letters = str.ascii_letters
> in which case unicode.linebreaks would be the right spelling.
And it would have the advantage, that it could work both with str and
unicode if we had both str.linebreaks and unicode.linebreaks
>>> I'm not so sure anymore. It is good for consistency, but I doubt
>>> are actual use cases: how often do you want only the first n lines
>>> of some string? Reading the first n lines of a file might be an
>>> application, but then, you would rather use .readline() directly.
>> Not every unicode string is read from a StreamReader.
> Sure: but how often do you want to fetch the first line of a Unicode
> string you happen to have in memory, without iterating over all lines
I don't know. The only obvious spot in the standard library (apart
from codecs.py) seems to be
def shortdescription(self): return self.description().splitlines()
>> Another solution would be to have a unicode.itersplitlines() and
>> the iterator. Then we wouldn't need a maxsplit because you simply can
>> stop iterating once you have what you want.
> That might work. I would then ask for itersplitlines to return pairs
> of (line, truncated) so you can easily know whether you merely ran
> into the end of the string, or whether you got a complete line
> (although it might be a bit too specific for the readlines() case)
Or maybe (line, terminatorlength) which gives you the same info
(terminatorlength == 0 means truncated) and makes it easy to strip
>> So reverting to the 2.3 behaviour for simple codecs is out?
> I'm -1, atleast. It would also fix the problem at hand, for the
> case. However, it does leave some codecs in the cold, most notably
> UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is
> built-in in the parser).
You meant PEP 263, right?
> I think the UTF-8 stream reader should support
> all Unicode line breaks, so it should continue to use the Python
> However, UTF-8 is fairly common, so that reading an
> UTF-8-encoded file line-by-line shouldn't suck.
OK, so what's missing is a solution for str->str codecs (or we keep
line = line.splitlines(False) and test, whether this is fast enough).
More information about the Python-Dev