Unicode program representation

M.-A. Lemburg mal at lemburg.com
Mon Apr 3 09:28:13 EDT 2000

Neil Hodgson wrote:
>    This leads to the question of what the use of the u" form is. The current
> answer is that u" takes the ASCII string and makes a Unicode string object
> by extending each byte with another zero byte. Because its a Unicode string
> object it behaves appropriately with Unicode aware functions.
>    I think this should be changed to interpreting the literal as a UTF-8
> literal. The advantage here is that non-roman string literals become a
> natural part of the language.

There was some discussion about this when the Unicode integration
was designed late last year. We finally decided to use a fixed
internal default encoding and have the user apply all necessary
conversions at his/her own will. As you might have guessed,
the default encoding is UTF-8. However, since Python scripts
are intended to be writable in plain 7-bit ASCII, there had
to be a painless form for encoding Unicode in 7-bit ASCII.
This is what the 'unicode-escape' encoding does.

Here's a quote from the Misc/unicode.txt file which currently
is the main documentation source:

Python should provide a built-in constructor for Unicode strings which
is available through __builtins__:

  u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])

  u = u'<unicode-escape encoded Python string>'

  u = ur'<raw-unicode-escape encoded Python string>'

With the 'unicode-escape' encoding being defined as:

· all non-escape characters represent themselves as Unicode ordinal
  (e.g. 'a' -> U+0061).

· all existing defined Python escape sequences are interpreted as
  Unicode ordinals; note that \xXXXX can represent all Unicode
  ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.

· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
  error to have fewer than 4 digits after \u.

For an explanation of possible values for errors see the Codec section


u'abc'          -> U+0061 U+0062 U+0063
u'\u1234'       -> U+1234
u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+005c

The 'raw-unicode-escape' encoding is defined as follows:

· \uXXXX sequence represent the U+XXXX Unicode character if and
  only if the number of leading backslashes is odd

· all other characters represent themselves as Unicode ordinal
  (e.g. 'b' -> U+0062)

Note that you should provide some hint to the encoding you used to
write your programs as pragma line in one the first few comment lines
of the source file (e.g. '# source file encoding: latin-1'). If you
only use 7-bit ASCII then everything is fine and no such notice is
needed, but if you include Latin-1 characters not defined in ASCII, it
may well be worthwhile including a hint since people in other
countries will want to be able to read you source strings too.

BTW, if you prefer a different "default" encoding, simply
write a wrapper for unicode() which uses the modified default:

def u(text,encoding='latin-1'):
    return unicode(text, encoding)

You can also have stdout write Latin-1 for Unicode strings,
by wrapping it using the StreamCodecs in codecs.py.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/

More information about the Python-list mailing list