UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

M.-A. Lemburg mal@lemburg.com
Thu, 11 Nov 1999 15:21:50 +0100


Tim Peters wrote:
> 
> [/F, dripping with code]
> > ...
> > Note that the 'u' must be followed by four hexadecimal digits.  If
> > fewer digits are given, the sequence is left in the resulting string
> > exactly as given.
> 
> Yuck -- don't let probable error pass without comment.  "must be" == "must
> be"!

I second that.
 
> [moving backwards]
> > \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> > character is stored using UTF-8 encoding, which means that this
> > sequence can result in up to three encoded characters.
> 
> The code is fine, but I've gotten confused about what the intent is now.
> Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
> literals, but now he's got Unicode-escaped literals instead -- and you favor
> an internal 2-byte-per-char Unicode storage format.  In that combination of
> worlds, is there any use in the *language* (as opposed to in a runtime
> module) for \uxxxx -> UTF-8 conversion?

No, no...  :-) 

I think it was a simple misunderstanding... \uXXXX is only to be
used within u'' strings and then gets expanded to *one* character
encoded in the internal Python format (which is heading towards UTF-16
without surrogates).
 
> And MAL, if you're listening, I'm not clear on what a Unicode-escaped
> literal means.  When you had UTF-8 literals, the meaning of something like
> 
>     u"a\340\341"
> 
> was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
> were just a way of specifying a byte stream.  As a Unicode-escaped string, I
> assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
> escapes to be taken as two separate Latin-1 characters (in their role as a
> Unicode subset), or as an especially clumsy way to specify a single 16-bit
> Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
> escapes.

Good points.

The conversion goes as follows:
· for single characters (and this includes all \XXX sequences except \uXXXX),
  take the ordinal and interpret it as Unicode ordinal
· for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
  instead
 
> One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
> There probably should be; and while Guido will hate this, a ur string should
> probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
> Java defines \uxxxx expansion as occurring in a preprocessing step.

Not sure whether we really need to make this even more complicated...
The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames
won't hurt much in the context of those \uXXXX monsters :-)

> BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
> isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).

Right. \uXXXX will only be allowed in u'' strings, not in "normal"
strings.

BTW, if you want to type in UTF-8 strings and have them converted
to Unicode, you can use the standard:

u = unicode('...string with UTF-8 encoded characters...','utf-8')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/