[Python-bugs-list] [ python-Feature Requests-691291 ] codecs.open(filename, 'U', 'UTF-16') corrupts text

Tue, 04 Mar 2003 02:12:15 -0800

Feature Requests item #691291, was opened at 2003-02-22 20:21
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=691291&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jason Orendorff (jorend)
>Assigned to: Nobody/Anonymous (nobody)
Summary: codecs.open(filename, 'U', 'UTF-16') corrupts text

Initial Comment:
Tested in Python 2.3a1.

If I write u'Hello\r\nworld\r\n' to a file, then read
it back in 'U' mode, I should get u'Hello\nworld\n'.

However, if I do this using codecs.open() and the
UTF-16 encoding, I get u'Hello\n\nworld\n\n'.

codecs.open() is not 'U'-mode-aware.  The underlying
file is opened in universal newline mode, so the byte
'\x0d' is erroneously translated to '\x0a' before the
UTF-16 codec has a chance to decode it.

The attached unit test should show specifically what it
is that I wish would work.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2003-03-04 11:12

Message:
Logged In: YES 
user_id=38388

The proper thing to do would be to read the file content
as Unicode and then use the .splitlines() method on the
resulting data. The latter knows about the various ways
you can do line ending in Unicode, including the Mac, DOS
and Unix variations.

I don't have time for this, so unassigning it again.

----------------------------------------------------------------------

Comment By: Jack Jansen (jackjansen)
Date: 2003-03-03 13:10

Message:
Logged In: YES 
user_id=45365

The problem is that codecs.open() forces binary mode on the underlying file object, and this defeats the U mode.

My feeling is that it should be okay to open the underlying file in text mode, thereby enabling the U flag to be passed. Opening the file in text mode would break, however, if one of the following conditions is met:
- there are encodings where 0x0a or 0x0d are valid characters, not end of line.
- there are libc implementations where opening a file in text mode has
more implications than converting \r or \r\n to \n, i.e. if they change
other bytes as well.

Re-assigning to  MAL, as he put the binary mode in in the first place. If this was just defensive programming we  might try taking it out, if there was a real error case with text mode then codecs.open should probably at least signal an error if universal newline mode is requested.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-26 14:44

Message:
Logged In: YES 
user_id=38388

I'm turning this into a feature request. codecs.open()
does not support 'U' as file mode.

Assigning to Jack since he introduced the 'U' mode option.
Jack, what can we do about this ?

----------------------------------------------------------------------

Comment By: Jason Orendorff (jorend)
Date: 2003-02-22 22:17

Message:
Logged In: YES 
user_id=18139

Tested in Python 2.3a2 as well (the bug is still there).

Note that this isn't limited to UTF-16.  It will affect any
encoding that uses the byte '\x0d' to mean anything other
than u'\r'.  The most common American/European encodings are
safe (ASCII, Latin-1 and friends, and UTF-8).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=691291&group_id=5470