[I18n-sig] codecs module, readlines and xreadlines
Poor Yorick
gp@pooryorick.com
Thu, 16 Jan 2003 08:59:48 -0700
Martin v. Löwis wrote:
>"M.-A. Lemburg" <mal@lemburg.com> writes:
>
>>On Windows, the 'r' opens the file in text which mangles the line-end
>>information. You should try to open the file in 'rb' (binary) mode
>>for comparison.
>>
>
>The issue is, of course, that codecs.open is usually meant for text
>data, so comparing 'r' to 'r' is fair, IMO.
>
>>codecs.open() automatically appends the 'b' to the 'r' for you,
>>so this is probably the cause of the problem.
>>
>
Whether the file is opened in binary mode or in text mode, the '\r'
character is still there. It isn't mangled, it's just that in the
utf-16 encoding all characters are encoded as double-byte characters,
and \r\n becomes \x00\r\x00\n.
The thing is that I AM processing text data. It just happens to be
unicode text data. The example I used turns into perfectly legible
chinese characters once it's decoded in Python. I think that people
using the codecs module on Windows to read Unicode text files would
expect codecs.open.readlines to behave exactly like the builtin
open.readlines.
open.readlines automatically removes the "\r" character on Windows
systems when the file is opened and read in text mode, and inserts a \r
character when a \n is written to a file, so to be consistent,
codecs.open.readlines should do the same thing and remove \x00\r when
the file is opened in text mode.
Poor Yorick
gp@pooryorick.com