[Python-Dev] String methods... finally
fredrik at pythonware.com
Wed Jun 16 11:53:23 CEST 1999
> > The \u escape takes up to 4 bytes
> Not in Java: it requires exactly 4 hex characters after == exactly 2 bytes,
> and it's an error if it's followed by fewer than 4 hex characters. That's a
> good rule (simple!), while ANSI C's is too clumsy to live with if people
> want to take Unicode seriously.
> So what does it mean for a Unicode escape to appear in a non-L string?
my suggestion is to store it as UTF-8; see the patches
included in the unicode package for details.
this also means that an u-string literal (L-string, whatever)
could be stored as an 8-bit string internally. and that the
following two are equivalent:
string = u"foo"
string = unicode("foo")
also note that:
unicode(str(u"whatever")) == u"whatever"
on the other hand, this means that we have at least four
major "arrays of bytes or characters" thingies mapped on
two data types:
the old string type is used for:
-- plain old 8-bit strings (ascii, iso-latin-1, whatever)
-- byte buffers containing arbitrary data
-- unicode strings stored as 8-bit characters, using
the UTF-8 encoding.
and the unicode string type is used for:
-- unicode strings stored as 16-bit characters
is this reasonable?
yet another question is how to deal with source code.
is a python 1.6 source file written in ASCII, ISO Latin 1,
speaking from a non-us standpoint, it would be really
cool if you could write Python sources in UTF-8...
More information about the Python-Dev