[ python-Bugs-904474 ] File read of Chinese utf-16-le treats upper byte 1A as EOF

Mon Dec 19 05:40:11 CET 2005

Bugs item #904474, was opened at 2004-02-25 11:30
Message generated for change (Settings changed) made by nnorwitz
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=904474&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: None
>Status: Closed
>Resolution: Invalid
Priority: 5
Submitted By: Ron Rother (rrother)
Assigned to: Nobody/Anonymous (nobody)
Summary: File read of Chinese utf-16-le treats upper byte 1A as EOF

Initial Comment:
Any utf-16-le Chinese character with 1A as the most 
significant byte causes remainder of file to be ignored.

code extract:

(utf16_encoder, utf16_decoder, utf16_reader, 
utf16_writer) = codecs.lookup("utf-16-le")

ifile = utf16_reader(open(sys.argv[1],"r"))

t=ifile.read()

When the Chinese character 1A 5C (&#23578;) is encoundered, 
everthing from the 5C is discarded.

These 3 lines:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="&#23578;&#26410;&#36873;&#25321;&#20219;&#20309;&#20070;&#21367;&#65281;"

are input as:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2005-10-02 18:19

Message:
Logged In: YES 
user_id=33168

MAL, this seems to come up from time to time.  Perhaps we
should update the doc for open()?  If it's already
documented, could we make it clearer?  Then we should be
able to close this bug.  I think I saw another bug recently
that was similar to this one.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-02-25 14:53

Message:
Logged In: YES 
user_id=38388

I believe there is a misconception here: the open(..., "r")
will cause the file to be opened in C lib's text mode. Since
UTF-16 is binary data, this will lead to problems with line
breaking
and file handling in general.

You should try:

import codecs
ifile = codecs.open(filename, 'rb', encoding='utf-16-le')

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=904474&group_id=5470