Bug with win32 open and utf-16 file

Sun Aug 24 00:13:44 EDT 2003

On Sun, 24 Aug 2003 05:40:56 +0300, Christos "TZOTZIOY" Georgiou
<tzot at sil-tec.gr> wrote:

>On Sun, 24 Aug 2003 00:29:27 GMT, rumours say that derek / nul
><abuseonly at sgrail.org> might have written:
>
>>>[snip opening 'rb' a UTF-16 file]
>>>
>>>>The original file has line terminator characters of 00 0d 00 0a.
>>>>After being read into a variable or a list the line termination characters have
>>>>been changed to 00 0a 00 0a
>
>I believe you don't explain completely your situation, or maybe I am
>missing something.  The code you posted, which I will quote right here:
>
>>#!c:/program files/python/python.exe
>># win32 python 2.3
>>
>>import sys, string, codecs
>>#eng_file = list (open("c:/program files/microsoft games/train
>>simulator/trains/trainset/dash9/dash9.eng", "rb").read())	# read the whole file
>>eng_file = open("c:/program files/microsoft games/train
>>simulator/trains/trainset/dash9/dash9.eng", "rb").read()	# read the whole file
>>
>>print hexdump (eng_file)						# ok
>
>opens the file in 'read binary' mode, and without further processing you
>pass the data just read to the function hexdump (presumably written by
>you).  The output of the hex dump shows only \u000a and not any \u000d.
>I tried it by downloading your file and ran:

hexdump (not mine)

import cStringIO

def hexdump(data):
    global printable_chars
    addr, bytes, ascii = "$00000000", [' '] * 20, [' '] * 20
    result = cStringIO.StringIO()
    print >>result, "Dump of %d Bytes (type %s)" % (len(data), type(data))
    for i in range(len(data)):
        byte = ord(data[i])
        bytes[i%20], ascii[i%20] = "%02X" % byte, printable_chars[byte]
        if i%4 == 3: bytes[i%20] += " "
        if (i % 20) == 19:
            print >>result, addr, ''.join(bytes), ''.join(ascii)
            addr, bytes, ascii = "$%08X" % (i+1), [' '] * 20, [' '] * 20
    i = len(data)
    if i and (i % 20) <> 0:
        print >>result, addr, "%-45s" % ''.join(bytes).strip(),
	''.join(ascii)
    	return result.getvalue()

printable_chars = ['.'] * 256
for __c in
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"$:;-#+*%&/()=?'":
    printable_chars[ord(__c)] = __c

>>>> data = file("c:/dash9.eng", "rb").read()
>>>> data[-10:]
>'\r\x00\n\x00)\x00\r\x00\n\x00'

Interesting, I have just done the same and it would appear that the hexdump
routine is changing 0d to 0a ???

output from my screen

>>> data[2:]
'S\x00I\x00M\x00I\x00S\x00A\x00@\x00@\x00@\x00@\x00@\x00@\x00@\x00@\x00@\x00@\x0
0J\x00I\x00N\x00X\x000\x00D\x000\x00t\x00_\x00_\x00_\x00_\x00_\x00_\x00
\r\x00\n\x00
\r\x00\n\x00
W\x00a\x00g\x00o\x00n\x00 \x00(\x00 \x00D\x00a\x00s\x00h\x009\x00

This sequence is correct
This will teach me to have faith in code I have not tested!!

>which seems correct to me (\r is 0x0d, \n is 0x0a, and the file is
>indeed utf-16 le (little endian).  Which open command are you using?  If
>you just enter open in the python prompt, does it show:
>
>>>> open
><type 'file'>
>
>or what?

I am not running in immediate mode, I am running 'python apply_physics.pl'

I don't understand what the data[-2:] is doing, could you explain please or
point to some notes on this.

thanks Derek