read(1) returns string of length 2
Bengt Richter
bokr at oz.net
Wed Nov 24 11:14:49 EST 2004
On Wed, 24 Nov 2004 12:03:41 GMT, "wolfgang haefelinger" <wh2005 at web.de> wrote:
>Greetings,
>
>I'm trying to read (japanese) chars from a file. While doing so
>I encounter that a char with length 2 is returned. Is this to be
>expected or is there something wrong?
>
>Basically it's this what I'm doing:
>
>import codecs
>f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed
>
>c = f.read(1)
>while c:
> if len(c)==1:
> print hex(ord(c)),
> else:
> print "{",
> for x in c: print hex(ord(x)),
> print "}",
> c = f.read(1)
>
>This is my input (file is also attached):
>
>$ od -tx1 ident.in
>0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
>0000013
>
>This is what I'm getting:
>
>$ python ident.py ## python 2.3.4
>on Windows
>0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa
>
>"Python" believes that there are 6 chars on the stream while there are
>actually 7 chars.
>
>My naive assumption was that f.read(1) returns always a char of length 1 (or
>zero).
On my 2.4b1 it does, see below.
>
>Remark:
>The input is believed to be "SJIS" but I haven't found a Python codecs for
>this.
>Therefore I'm using Shift-JIS. Of course this could be the problem. Note
>that
>when feeding Java with my input "correct" using SJIS, chars are spit out:
>
> c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)
>
>References:
>I downloaded Japanese codecs from here (version: 1.4.10)
> http://www.asahi-net.or.jp/~rd6t-kjym/python/
>
>Thanks for any hints,
>Wolfgang.
I added a print line and dropped the ending commas on your print chunks,
but otherwise didn't (I think ;-) change your code:
Python 2.4b1 (#56, Nov 3 2004, 01:47:27)
[GCC 3.2.3 (mingw special 20030504-1)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs
>>> f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed
>>> c = f.read(1)
>>> while c:
... print repr(c), len(c), '=>',
... if len(c)==1:
... print hex(ord(c))
... else:
... print "{",
... for x in c: print hex(ord(x)),
... print "}"
... c = f.read(1)
...
u'\u5408' 1 => 0x5408
u'\u8a08' 1 => 0x8a08
u'\u6642' 1 => 0x6642
u'\u9593' 1 => 0x9593
u';' 1 => 0x3b
u'\r' 1 => 0xd
u'\n' 1 => 0xa
I reproduced your binary file:
>>> for c in open('ident.in','rb').read(): print ('%02x'% ord(c)),
...
8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
What version/platform are you using? Perhaps you can upgrade?
Regards,
Bengt Richter
More information about the Python-list
mailing list