[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
walter at livinglogic.de
Wed Aug 24 19:35:11 CEST 2005
M.-A. Lemburg wrote:
> Walter Dörwald wrote:
>>I wonder if we should switch back to a simple readline() implementation
>>for those codecs that don't require the current implementation
>>(basically every charmap codec).
> That would be my preference as well. The 2.4 .readline() approach
> is really only needed for codecs that have to deal with encodings
> a) use multi-byte formats, or
> b) support more line-end formats than just CR, CRLF, LF, or
> c) are stateful.
> This can easily be had by using a mix-in class for
> codecs which do need the buffered .readline() approach.
Should this be a mix-in or should we simply have two base classes? Which
of those bases/mix-ins should be the default?
>>AFAIK source files are opened in
>>universal newline mode, so at least we'd get proper treatment of "\n",
>>"\r" and "\r\n" line ends, but we'd loose u"\x1c", u"\x1d", u"\x1e",
>>u"\x85", u"\u2028" and u"\u2029" (which are line terminators according
> While the Unicode standard defines these characters as line
> end code points, I think their definition does not necessarily
> apply to data that is converted from a certain encoding to
> Unicode, so that's not a big loss.
> E.g. in ASCII or Latin-1, FILE, GROUP and RECORD
> SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85)
> are not interpreted as line end characters.
> Furthermore, we had no reports of anyone complaining in
> Python 1.6, 2.0 - 2.3 that line endings were not detected
> properly. All these Python versions relied on the stream's
> .readline() method to get the next line. The only bug reports
> we had were for UTF-16 which falls into the above
> category a) and did not support .readline() until Python 2.4.
More information about the Python-Dev