[Python-3000] Handling of wide Unicode characters
Alexandre Vassalotti
alexandre at peadrop.com
Sat Jun 2 01:49:01 CEST 2007
Thanks for explanation. Anyway, it certainly much simpler to deal with
surrogate pairs than with variable-width characters.
On 6/1/07, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Alexandre Vassalotti" <alexandre at peadrop.com> wrote:
> > Hi,
> >
> > I was doing some testing on the new _string_io module, since I was
> > slightly skeptical on my handling of wide Unicode characters (32-bit
> > of length, instead of the usual 16-bit in UTF-16). So, I ran this
> > little test:
> >
> > >>> s = _string_io.StringIO()
> > >>> s.write(u'')
> > >>> s.tell()
> > 2
> >
> > Like I expected, wide Unicode characters count for two. However, I was
> > surprised that Python treats them as two characters as well:
> >
> > >>> len(u'')
> > 2
> > >>> u''
> > u'\ud87e\udccd'
> >
> > Is it a bug, or only an implementation choice?
>
> If your Python is compiled as a UTF-16 build, then any character in the
> extended plane will be seen as two characters by Python. If you are
> using a UCS-4 build (it's the same as UTF-32), then you should be seeing
> the single wide character as a single wide character. The only
> exception to this rule is if you enter the wide character as a surrogate
> pair, in which case Python doesn't normalize it into the single wide
> character. To get a real wide character, you would need to use a proper
> escape, or decode from an encoded string.
>
>
> - Josiah
>
>
--
Alexandre Vassalotti
More information about the Python-3000
mailing list