[XML-SIG] Re: Issues with Unicode type
Sjoerd Mullender
sjoerd@acm.org
Tue, 24 Sep 2002 10:19:56 +0200
Nobody seems to have bothered looking at the two characters produced
by u'\u10800'. I'd say: try it:
+ python
Python 2.3a0 (#78, Sep 20 2002, 11:19:50)
[GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u"\u10800"
>>> len(c)
2
>>> c
u'\u10800'
>>> c[0]
u'\u1080'
>>> c[1]
u'0'
>>>
In other words, the \u escape takes the next 4 hex digits and uses
those to create a unicode character, and what's left over is just
appended.
If you use the \U escape you need to provide 8 hex digits:
>>> c = u'\U00010800'
>>> len(c)
2
>>> c[0]
u'\ud802'
>>> c[1]
u'\udc00'
>>>
And here we see the surrogates appear. It's still 2 characters long.
On Mon, Sep 23 2002 Daniel Veillard wrote:
> On Mon, Sep 23, 2002 at 03:58:11PM -0600, Uche Ogbuji wrote:
> > > > Can you confirm that this is what RedHat does by default as mentioned
> > > > Uche and do you know the motivations (and eventually downsides) for this
> > > > decision?
> > >
> > > By default Red Hat compiles python with unicode support in UTF-16.
> > > I'm not in charge of this, I assume it's the default compilation option.
> >
> > Not from what we found. Jeremy was the one who encountered this, not me, but
> > I'm pretty sure he said he found that starting with RH 7.3, Red Hat started
> > building Python 2.x with UTF-32 and whchar_t support.
>
> Hum, here on 2 recent versions :-)
>
> paphio:~ -> python2.2
> Python 2.2 (#1, Apr 12 2002, 15:29:57)
> [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-109)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> c = u"\u10800"
> >>> len(c)
> 2
> >>>
>
> gnome:~ -> python
> Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
> [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> c = u"\u10800"
> >>> len(c)
> 2
> >>>
>
> looks like UTF16 to me !
>
> > > IMHO it's a wrong assumption to think that UTF16 is a good cut, because
> > > you end up with variable lenght encoding anyway, and UCS32 would seriously
> > > bloat the app I'm afraid.
> >
> > Just as a side observation: Guido called this FUD. I'm not so sure.
>
> It's just my opinion, and as a whole me and other in the Gnome and KDE
> projects all went UTF8 without apriori concertation, it was just natural
> to us (okay this also keep strings 0 terminated which is crucial).
>
> Daniel
>
> --
> Daniel Veillard | Red Hat Network https://rhn.redhat.com/
> veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
> http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
>
> _______________________________________________
> XML-SIG maillist - XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig
>
-- Sjoerd Mullender <sjoerd@acm.org>