Need help on UNICODE conversion

Bernd Preusing b.preusing at web.de
Sun Sep 7 01:22:35 EDT 2003


Erik Max Francis <max at alcyone.com> wrote:

>Bernd Preusing wrote:
>
>> I have a JPG file which contains some comment as unicode.
>> 
>> After reading in the string with s=file.read(70) from file offest 4
>> I get a string which is shown as
>> 'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
>> (using Komodo).
>
>As others have pointed out, this seems to be an unfaithful cut and
>paste; to really tell what it is we'd have to see the actual contents of
>the string.  If it is really Unicode, however, it looks like it might be
>a UTF-16 encoding.  Try 'utf-16' for the encoding name.

Yes, sorry. Cut & paste was not possible, so I wrote it down
with some errors, very tired and frustrated :-(
I had tried to attach a small screenshot, but this is no binary news
group...

My first fault was to cut off the first 7 bytes, but I had to
eliminate 8.

The byte array is
0000: 55 4e 49 43 4f 44 45 00 00 4b 00 6f 00 6d 00 6d UNICODE..K.o.m.m
0010: 00 65 00 6e 00 74 00 61 00 72 00 20 00 55 00 6e .e.n.t.a.r. .U.n
0020: 00 69 00 63 00 6f 00 64 00 65 00 20 00 2a 00 e4 .i.c.o.d.e. .*..
0030: 00 f6 00 fc 00 c4 00 d6 00 dc 00 df 00 2a 00 0d
0040: 00 0a 00 0d 00 0a

I had to cut off the beginning, which is "UNICODE\x00".
The remainder means "Kommentar Unicode *äöüÄÖÜß*"
(this contains german umlauts at the end)

Now I have a string
ustring = "\x00K\x00o\x00m....."

us2 = unicode(ustring, "utf_16")
yields: UnicodeDecodeError: 'utf16' codec can't decode bytes in
position 48-49: illegal encoding

Strange, because that position is at "00 dc" and not earlier!?

According to your tips I stripped off all remainig \x00 and got
"Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n"

I can go on with that string now :-))
But what would have been the "right" way?

Thaks again
  Bernd





More information about the Python-list mailing list