[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 19:35:11 CEST 2005

M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>>I wonder if we should switch back to a simple readline() implementation 
>>for those codecs that don't require the current implementation 
>>(basically every charmap codec). 
> 
> That would be my preference as well. The 2.4 .readline() approach
> is really only needed for codecs that have to deal with encodings
> that:
> 
> a) use multi-byte formats, or
> b) support more line-end formats than just CR, CRLF, LF, or
> c) are stateful.
> 
> This can easily be had by using a mix-in class for
> codecs which do need the buffered .readline() approach.

Should this be a mix-in or should we simply have two base classes? Which 
of those bases/mix-ins should be the default?

>>AFAIK source files are opened in 
>>universal newline mode, so at least we'd get proper treatment of "\n", 
>>"\r" and "\r\n" line ends, but we'd loose u"\x1c", u"\x1d", u"\x1e", 
>>u"\x85", u"\u2028" and u"\u2029" (which are line terminators according 
>>to unicode.splitlines()).
> 
> While the Unicode standard defines these characters as line
> end code points, I think their definition does not necessarily
> apply to data that is converted from a certain encoding to
> Unicode, so that's not a big loss.
> 
> E.g. in ASCII or Latin-1, FILE, GROUP and RECORD
> SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85)
> are not interpreted as line end characters.
> 
> Furthermore, we had no reports of anyone complaining in
> Python 1.6, 2.0 - 2.3 that line endings were not detected
> properly.  All these Python versions relied on the stream's
> .readline() method to get the next line. The only bug reports
> we had were for UTF-16 which falls into the above
> category a) and did not support .readline() until Python 2.4.

True.

Bye,
    Walter Dörwald