[ python-Bugs-1377394 ] read() / readline() blow up if file has even number of char.

Mon Dec 12 14:39:02 CET 2005

Bugs item #1377394, was opened at 2005-12-09 22:43
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1377394&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.4
>Status: Closed
Resolution: None
Priority: 5
Submitted By: superwesman (superwesman)
Assigned to: M.-A. Lemburg (lemburg)
Summary: read() / readline() blow up if file has even number of char.

Initial Comment:
Hello, I am having a problem with the read() and
readline() functions.  I'm using codecs.open() to open
a text file, then using either read() or readline() to
get its contents.  In python 2.4.2, if the file has an
even number of characters, I get a UnicodeDecodeError.
 If python 2.4.1 this works regardless of the character
count.  I've pasted below a sample script and the
sample text file I was running.  This is the command I
executed at the Windows 2000 CMD prompt:

python sample.py sample.txt

Again, in 2.4.1, this works fine - in 2.4.2 it breaks
when the file-to-be-read has an odd number of characters.

Thanks.
-w

# start: sample.py

import codecs
import sys

print "open the file"
in_file = codecs.open( sys.argv[1], "r",
"unicode_internal" )
print "read the file"
the_file = in_file.read()
print "close the file"
in_file.close()
print "done"

# end: sample.py

# start: sample.txt
RESULTHOST=vivaldi
RESULTPORT=a
DB_XML=/test/art/jfw/config/DBList.xml
LOGCHECK_IGNORE=art_actions.txt

# end: sample.txt

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-12-12 14:39

Message:
Logged In: YES 
user_id=38388

Closing this bug report as "won't fix" (even though SF seems
to have removed this option from the tracker, or at least I
don't see it in Firefox).

Removing "unicode_internal" from the docs is not an option:
this is a valid encoding, albeit one that depends on the way
Python is built.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-12-12 14:30

Message:
Logged In: YES 
user_id=89016

With the Python 2.4.2 I get the following output both on
Linux and Windows:

open the file
read the file
close the file
done

This is totally independent of the type of line feeds in
sample.txt or the length of the file (even or odd).

> If it is a valid option (that should only be used
> "Python internally" - not sure what that means)
> then it should perform consistently regardless
> of the number of characters in the file, should it not?

unicode_internal just dumps the data bytes of the Unicode
object. This means that (depending on the way Python is
compiled) the length of a unicode_internal encoded byte
string will always be a multiple of 2 or 4. So a byte string
that has on odd number of bytes clearly is broken and
decoding would have the right to complain about that. In
2.4.2 it doesn't, because it's not clear to the StreamReader
API if there's more data available on subsequent calls to
read() (and the last odd byte is silently dropped).

BTW, the data read by your script is probably not what you
might have expected. On a UCS-2 build the result is:

u'\u2023\u7473\u7261\u3a74\u7320\u6d61\u6c70\u2e65\u7874\u0a74\u4552\u5553\u544c\u4f48\u5453\u763d\u7669\u6c61\u6964\u520a\u5345\u4c55\u5054\u524f\u3d54\u0a61\u4244\u585f\u4c4d\u2f3d\u6574\u7473\u612f\u7472\u6a2f\u7766\u632f\u6e6f\u6966\u2f67\u4244\u694c\u7473\u782e\u6c6d\u4c0a\u474f\u4843\u4345\u5f4b\u4749\u4f4e\u4552\u613d\u7472\u615f\u7463\u6f69\u736e\u742e\u7478'

(or something similar depending on your line feeds).

----------------------------------------------------------------------

Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-12-10 11:57

Message:
Logged In: YES 
user_id=1188172

I'd suggest unicode_internal to be removed from the docs.

----------------------------------------------------------------------

Comment By: superwesman (superwesman)
Date: 2005-12-10 00:17

Message:
Logged In: YES 
user_id=1401447

I didn't realize that 'unicode_internal' was not a
legitimate value to pass into this function.  If
'unicode_internal' is not a valid 3rd parameter to
codecs.open(), shouldn't that function complain?  If it is a
valid option (that should only be used "Python internally" -
not sure what that means) then it should perform
consistently regardless of the number of characters in the
file, should it not?

Seems to me that pilot-error uncovered a bug.  If this is
not a valid choice, then codecs.open() should complain.  If
it is valid, it should perform consistently, IMHO.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-12-09 23:04

Message:
Logged In: YES 
user_id=38388

Why would you want to read a file using the Python internal
Unicode encoding (unicode_internal) ?

This is an encoding that is only used Python internally and
should not be used for anything else.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1377394&group_id=5470