python-unicode doesn't support >65535 symbols?

Thu Nov 27 07:24:41 EST 2003

gabor <gabor at z10n.net> writes:

> i played around with iconv and so on,
> so at the end i created an utf8 encoded text file,
> with the text "Marrakesh",
> where the second 'a' wes replaced with
> GOTHIC_LETTER_AHSA (unicode-value:0x10330).
> 
> (i simply wrote the text file "Marrakesh", used iconv to convert it to
> utf32big-endian, and replaced the character in hexedit, then converted
> with iconv back to utf8).
> 
> now i started python:
> 
> >>> data = open("utf8.txt").read()
> >>> data
> 'Marr\xf0\x90\x8c\xb0kesh'
> >>> text = data.decode("utf8")
> >>> text
> u'Marr\U00010330kesh'
> 
> so far it seemed ok.
> then i did:
> 
> >>> len(text)
> 10
> 
> this is wrong. the length should be 9.

I suspect you have a "narrow unicode" build of Python.  You can make
yourself a "wide unicode" build easily enough.

> and why?
> 
> >>> text[0]
> u'M'
> >>> text[1]
> u'a'
> >>> text[2]
> u'r'
> >>> text[3]
> u'r'
> >>> text[4]
> u'\ud800'
> >>> text[5]
> u'\udf30'
> >>> text[6]
> u'k'
> >>>
> 
> so text[3] (which should be \U00010330),
> was split to 2 16bit values (text[3] and text[4]).
> 
> i don't understand.
> if tthe representation of 'text' is correct, why is the length wrong?

I expect that this has to do with surrogates or some other unicode
thing that's beyond my understanding...

Cheers,
mwh

-- 
  It's actually a corruption of "starling".  They used to be carried.
  Since they weighed a full pound (hence the name), they had to be
  carried by two starlings in tandem, with a line between them.
                 -- Alan J Rosenthal explains "Pounds Sterling" on asr