[I18n-sig] Re: [Python-Dev] Unicode debate

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Tue, 2 May 2000 08:59:03 +0200


Neil Hodgson <nhodgson@bigpond.net.au> wrote:
>    I'm dropping in a bit late in this thread but can the current =
problem be
> summarised in an example as "how is 'literal' interpreted here"?
>=20
> s =3D aUnicodeStringFromSomewhere
> DoSomething(s + "<literal>")

nope.  the whole discussion centers around what happens
if you type:

    # example 1

    u =3D aUnicodeStringFromSomewhere
    s =3D an8bitStringFromSomewhere

    DoSomething(s + u)

and

    # example 2

    u =3D aUnicodeStringFromSomewhere
    s =3D an8bitStringFromSomewhere

    if len(u) + len(s) =3D=3D len(u + s):
        print "true"
    else:
        print "not true"

in Guido's design, the first example may or may not result in
an "UTF-8 decoding error: UTF-8 decoding error: unexpected
code byte" exception.  the second example may result in a
similar error, print "true", or print "not true", depending on the
contents of the 8-bit string.

(under the counter proposal, the first example will never
raise an exception, and the second will always print "true")

...

the string literal issue is a slightly different problem.

> The two options being that literal is either assumed to be encoded in
> Latin-1 or UTF-8. I can see some arguments for both sides.

better make that "two options", not "the two options" ;-)

a more flexible scheme would be to borrow the design from XML
(see http://www.w3.org/TR/1998/REC-xml-19980210). for those
who haven't looked closer at XML, it basically treats the source
file as an encoded unicode character stream, and does all pro-
cessing on the decoded side.

replace "entity" with "script file" in the following excerpts, and you
get close:

section 2.2:

    A parsed entity contains text, a sequence of characters,
    which may represent markup or character data.

    A character is an atomic unit of text as specified by
    ISO/IEC 10646.

section 4.3.3:

    Each external parsed entity in an XML document may
    use a different encoding for its characters. All XML
    processors must be able to read entities in either
    UTF-8 or UTF-16.=20

    Entities encoded in UTF-16 must begin with the Byte
    Order Mark /.../ XML processors must be able to use
    this character to differentiate between UTF-8 and
    UTF-16 encoded documents.

    Parsed entities which are stored in an encoding other
    than UTF-8 or UTF-16 must begin with a text declaration
    containing an encoding declaration.

(also see appendix F: Autodetection of Character Encodings)

I propose that we adopt a similar scheme for Python -- but not
in 1.6.  the current "dunno, so we just copy the characters" is
good enough for now...

</F>