[Python-bugs-list] [ python-Feature Requests-691291 ] codecs.open(filename, 'U', 'UTF-16') corrupts text

SourceForge.net noreply@sourceforge.net
Mon, 03 Mar 2003 04:12:00 -0800


Feature Requests item #691291, was opened at 2003-02-22 20:21
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=691291&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jason Orendorff (jorend)
>Assigned to: M.-A. Lemburg (lemburg)
Summary: codecs.open(filename, 'U', 'UTF-16') corrupts text

Initial Comment:
Tested in Python 2.3a1.

If I write u'Hello\r\nworld\r\n' to a file, then read
it back in 'U' mode, I should get u'Hello\nworld\n'.

However, if I do this using codecs.open() and the
UTF-16 encoding, I get u'Hello\n\nworld\n\n'.

codecs.open() is not 'U'-mode-aware.  The underlying
file is opened in universal newline mode, so the byte
'\x0d' is erroneously translated to '\x0a' before the
UTF-16 codec has a chance to decode it.

The attached unit test should show specifically what it
is that I wish would work.


----------------------------------------------------------------------

Comment By: Jack Jansen (jackjansen)
Date: 2003-03-03 13:10

Message:
Logged In: YES 
user_id=45365

The problem is that codecs.open() forces binary mode on the underlying file object, and this defeats the U mode.

My feeling is that it should be okay to open the underlying file in text mode, thereby enabling the U flag to be passed. Opening the file in text mode would break, however, if one of the following conditions is met:
- there are encodings where 0x0a or 0x0d are valid characters, not end of line.
- there are libc implementations where opening a file in text mode has
more implications than converting \r or \r\n to \n, i.e. if they change
other bytes as well.

Re-assigning to  MAL, as he put the binary mode in in the first place. If this was just defensive programming we  might try taking it out, if there was a real error case with text mode then codecs.open should probably at least signal an error if universal newline mode is requested.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-02-26 14:44

Message:
Logged In: YES 
user_id=38388

I'm turning this into a feature request. codecs.open()
does not support 'U' as file mode.

Assigning to Jack since he introduced the 'U' mode option.
Jack, what can we do about this ?

----------------------------------------------------------------------

Comment By: Jason Orendorff (jorend)
Date: 2003-02-22 22:17

Message:
Logged In: YES 
user_id=18139

Tested in Python 2.3a2 as well (the bug is still there).

Note that this isn't limited to UTF-16.  It will affect any
encoding that uses the byte '\x0d' to mean anything other
than u'\r'.  The most common American/European encodings are
safe (ASCII, Latin-1 and friends, and UTF-8).


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=691291&group_id=5470