[Python-Dev] String methods... finally

Fredrik Lundh fredrik at pythonware.com
Wed Jun 16 11:53:23 CEST 1999


> > The \u escape takes up to 4 bytes
> 
> Not in Java:  it requires exactly 4 hex characters after == exactly 2 bytes,
> and it's an error if it's followed by fewer than 4 hex characters.  That's a
> good rule (simple!), while ANSI C's is too clumsy to live with if people
> want to take Unicode seriously.
> 
> So what does it mean for a Unicode escape to appear in a non-L string?

my suggestion is to store it as UTF-8; see the patches
included in the unicode package for details.

this also means that an u-string literal (L-string, whatever)
could be stored as an 8-bit string internally.  and that the
following two are equivalent:

    string = u"foo"
    string = unicode("foo")

also note that:

    unicode(str(u"whatever")) == u"whatever"

...

on the other hand, this means that we have at least four
major "arrays of bytes or characters" thingies mapped on
two data types:

the old string type is used for:

-- plain old 8-bit strings (ascii, iso-latin-1, whatever)
-- byte buffers containing arbitrary data
-- unicode strings stored as 8-bit characters, using
   the UTF-8 encoding.

and the unicode string type is used for:

-- unicode strings stored as 16-bit characters

is this reasonable?

...

yet another question is how to deal with source code.
is a python 1.6 source file written in ASCII, ISO Latin 1,
or UTF-8.

speaking from a non-us standpoint, it would be really
cool if you could write Python sources in UTF-8...

</F>





More information about the Python-Dev mailing list