[XML-SIG] Re: Issues with Unicode type

25 Sep 2002 17:16:09 +0200

* Uche Ogbuji
| 
| Right, Jeremy.  I wasn't squinting hard enough at Daniel's example.
| In my own examples, I've been using
| 
| u"\U00010000"
| 
| or
| 
| u"\uD800\uDC00"
| 
| [...]
| 
| I'm not sure whether they would be equivalent if Python is compiled
| for UCS-4 

Python needs to decide what the \uXXXX escape syntax is referring to:
UTF-16 code units or Unicode code points. If the former, the first
example should be illegal. If the latter, the second example is highly
dubious (it's referring to unassigned code points that have a special
meaning in one of the encodings). 

I'm not sure whether the second should be outlawed, but it probably
should be. It's a sure way to create problems for yourself and if the
Unicode strings actually contain Unicode characters those values are
not legal.

| (BTW, there is no diff between UTF-32 and UCS-4, is there?). 

UTF-32 is Unicode, UCS-4 is ISO 10646. The Unicode code space used to
be more restricted than the ISO 10646 one, which ISO was supposed to
fix.  Not sure whether that fix has gone through yet, but probably it
has. Once it has there will be no difference.

| I would imagine Python would blindly create 2 pseudo code points
| D800 and DC00.  I say "pseudo" since, because these values are in
| the surrogate blocks, they are not valid characters in themselves.

Yup.

| Which leads me to believe that even though u"\uD800\uDC00" would be
| treated equivalently to u"\U00010000" as long as Python is compiled
| for UTF-16, that it is a *very* bad idea to write unicode literals
| that way.

Yup.

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >