[I18n-sig] codecs module, readlines and xreadlines

Poor Yorick gp@pooryorick.com
Thu, 16 Jan 2003 08:59:48 -0700


Martin v. Löwis wrote:

>"M.-A. Lemburg" <mal@lemburg.com> writes:
>
>>On Windows, the 'r' opens the file in text which mangles the line-end
>>information. You should try to open the file in 'rb' (binary) mode
>>for comparison.
>>
>
>The issue is, of course, that codecs.open is usually meant for text
>data, so comparing 'r' to 'r' is fair, IMO.
>
>>codecs.open() automatically appends the 'b' to the 'r' for you,
>>so this is probably the cause of the problem.
>>
>
Whether the file is opened in binary mode or in text mode, the '\r' 
character is still there.  It isn't mangled, it's just that in the 
utf-16 encoding all characters are encoded as double-byte characters, 
and \r\n becomes \x00\r\x00\n.

The thing is that I AM processing text data.  It just happens to be 
unicode text data.  The example I used turns into perfectly legible 
chinese characters once it's decoded in Python.  I think that people 
using the codecs module on Windows to read Unicode text files would 
expect codecs.open.readlines to behave exactly like the builtin 
open.readlines.  

open.readlines automatically removes the "\r" character on Windows 
systems when the file is opened and read in text mode, and inserts a \r 
character when a \n is written to a file, so to be consistent, 
codecs.open.readlines should do the same thing and remove \x00\r when 
the file is opened in text mode.

Poor Yorick
gp@pooryorick.com