[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 21:15:09 CEST 2005

Walter Dörwald wrote:
>> Right. Not sure what people think whether this should still be
>> supported, but I keep supporting it whenever I think of it.
> 
> 
> OK, so should we add this for 2.4.2 or only for 2.5?

You mean, string.unicodelinebreaks? I think something needs to be
done to fix the performance problem. In doing so, API changes
might occur. We should not add API changes in 2.4.2 unless they
contribute to the bug fix, and even then, the release manager
probably needs to approve them (in any case, they certainly
need to be backwards compatible)

> Should this really be put into string.py, or should it be a class
> attribute of unicode? (At least that's what was proposed for the other
> strings in string.py (string.whitespace etc.) too.

If the 2.4.2 fix is based on this kind of data, I think it should go
into a private attribute of codecs.py. For 2.5, I would put it
into strings for tradition. There is no point in having some of these
constants in strings and others as class attributes (unless we also
add them as class attributes in 2.5, in which case adding
unicodelinebreaks into strings would be pointless).

So I think in 2.5, I would like to see

# string.py
ascii_letters = str.ascii_letters

in which case unicode.linebreaks would be the right spelling.

>> I'm not so sure anymore. It is good for consistency, but I doubt there
>> are actual use cases: how often do you want only the first n lines
>> of some string? Reading the first n lines of a file might be an
>> application, but then, you would rather use .readline() directly.
> 
> 
> Not every unicode string is read from a StreamReader.

Sure: but how often do you want to fetch the first line of a Unicode
string you happen to have in memory, without iterating over all lines
eventually?

> Another solution would be to have a unicode.itersplitlines() and store
> the iterator. Then we wouldn't need a maxsplit because you simply can
> stop iterating once you have what you want.

That might work. I would then ask for itersplitlines to return pairs
of (line, truncated) so you can easily know whether you merely ran
into the end of the string, or whether you got a complete line
(although it might be a bit too specific for the readlines() case)

> So reverting to the 2.3 behaviour for simple codecs is out?

I'm -1, atleast. It would also fix the problem at hand, for the reported
case. However, it does leave some codecs in the cold, most notably
UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is
built-in in the parser). I think the UTF-8 stream reader should support
all Unicode line breaks, so it should continue to use the Python
approach. However, UTF-8 is fairly common, so that reading an
UTF-8-encoded file line-by-line shouldn't suck.

Regards,
Martin