read(1) returns string of length 2

Wed Nov 24 11:14:49 EST 2004

On Wed, 24 Nov 2004 12:03:41 GMT, "wolfgang haefelinger" <wh2005 at web.de> wrote:

>Greetings,
>
>I'm trying to read (japanese) chars from a file. While doing so
>I encounter that a char with length 2 is returned. Is this to be
>expected or is there something wrong?
>
>Basically it's this what I'm doing:
>
>import codecs
>f = codecs.open("ident.in",'rb','Shift-JIS')   ##  japanses codecs installed
>
>c = f.read(1)
>while c:
>   if len(c)==1:
>      print hex(ord(c)),
>   else:
>      print "{",
>      for x in c: print hex(ord(x)),
>      print "}",
>   c = f.read(1)
>
>This is my input (file is also attached):
>
>$ od -tx1 ident.in
>0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
>0000013
>
>This is what I'm getting:
>
>$ python ident.py                                          ## python 2.3.4 
>on Windows
>0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa
>
>"Python" believes that there are 6 chars on the stream while there are
>actually 7 chars.
>
>My naive assumption was that f.read(1) returns always a char of length 1 (or
>zero).
On my 2.4b1 it does, see below.

>
>Remark:
>The input is believed to be "SJIS" but I haven't found a Python codecs for 
>this.
>Therefore I'm using Shift-JIS. Of course this could be the problem. Note 
>that
>when feeding Java with my input  "correct" using SJIS, chars are spit out:
>
>  c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)
>
>References:
>I downloaded Japanese codecs from here (version: 1.4.10)
>  http://www.asahi-net.or.jp/~rd6t-kjym/python/
>
>Thanks for any hints,
>Wolfgang.
I added a print line and dropped the ending commas on your print chunks,
but otherwise didn't (I think ;-) change your code:

 Python 2.4b1 (#56, Nov  3 2004, 01:47:27)
 [GCC 3.2.3 (mingw special 20030504-1)] on win32
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import codecs
 >>> f = codecs.open("ident.in",'rb','Shift-JIS')   ##  japanses codecs installed
 >>> c = f.read(1)
 >>> while c:
 ...    print repr(c), len(c), '=>',
 ...    if len(c)==1:
 ...       print hex(ord(c))
 ...    else:
 ...       print "{",
 ...       for x in c: print hex(ord(x)),
 ...       print "}"
 ...    c = f.read(1)
 ...
 u'\u5408' 1 => 0x5408
 u'\u8a08' 1 => 0x8a08
 u'\u6642' 1 => 0x6642
 u'\u9593' 1 => 0x9593
 u';' 1 => 0x3b
 u'\r' 1 => 0xd
 u'\n' 1 => 0xa

I reproduced your binary file:

 >>> for c in open('ident.in','rb').read(): print ('%02x'% ord(c)),
 ...
 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a

What version/platform are you using? Perhaps you can upgrade?

Regards,
Bengt Richter