[Python-Dev] Encoding of 8-bit strings and Python source code

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Wed, 26 Apr 2000 00:05:54 +0200


M.-A. Lemburg wrote:
> > and as I've pointed out a zillion times, Python 1.6a2 doesn't.
>=20
> Just a side note: we never discussed turning the native
> 8-bit strings into any encoding aware type.

hey, you just argued that we should use UTF-8 because Tcl and
Perl use it, didn't you?

my point is that they don't use it the way Python 1.6a2 uses it,
and that their design is correct, while our design is slightly broken.

so let's fix it !

> Why not name the beast ?! In your proposal, the old 8-bit
> strings simply use Latin-1 as native encoding.

in my proposal, there's an important distinction between character
sets and character encodings.  unicode is a character set.  latin 1
is one of many possible encodings of (portions of) that set.

maybe it's easier to grok if we get rid of the term "character set"?

http://www.hut.fi/u/jkorpela/chars.html suggests the following
replacements:

character repertoire=20

    A set of distinct characters.

character code=20

    A mapping, often presented in tabular form, which defines
    one-to-one correspondence between characters in a character
    repertoire and a set of nonnegative integers.

character encoding=20

    A method (algorithm) for presenting characters in digital form
    by mapping sequences of code numbers of characters into
    sequences of octets.

now, in my proposal, the *repertoire* contains all characters
described by the unicode standard.  the *codes* are defined
by the same standard.

but strings are sequences of characters, not sequences of
octets:

    strings have *no* encoding.

(the encoding used for the internal string storage is an
implementation detail).

(but sure, given the current implementation, the internal storage
for an 8-bit string happens use Latin-1.  just as the internal
storage for a 16-bit string happens to use UCS-2 stored in
native byte order.  but from the outside, they're just character
sequences).

> The current version doesn't make any encoding assumption as
> long as the 8-bit strings do not get auto-converted. In that case
> they are interpreted as UTF-8 -- which will (usually) fail
> for Latin-1 encoded strings using the 8th bit, but hey, at least
> you get an error message telling you what is going wrong.

sure, but I don't think you get the right message, or that you
get it at the right time.  consider this:

if you're going from 8-bit strings to unicode using implicit con-
version, the current design can give you:

    "UnicodeError: UTF-8 decoding error: unexpected code byte"

if you go from unicode to 8-bit strings, you'll never get an error.

however, the result is not always a string -- if the unicode string
happened to contain any characters larger than 127, the result
is a binary buffer containing encoded data.  you cannot use string
methods on it, you cannot use regular expressions on it.  indexing
and slicing won't work.

    unlike earlier versions of Python, and unlike unicode-aware
    versions of Tcl and Perl, the fundamental assumption that
    a string is a sequence of characters no longer holds. =20

in my proposal, going from 8-bit strings to unicode always works.
a character is a character, no matter what string type you're using.

however, going from unicode to an 8-bit string may given you an
OverflowError, say:

    "OverflowError: unicode character too large to fit in a byte"

the important thing here is that if you don't get an exception, the
result is *always* a string.  string methods always work.  etc.

    [8. Special cases aren't special enough to break the rules.]

> The key to these problems is using explicit conversions where
> 8-bit strings meet Unicode objects.

yeah, but the flaw in the current design is the implicit conversions,
not the explicit ones.

    [2. Explicit is better than implicit.]

(of course, the 8-bit string type also needs an "encode" method
under my proposal, but that's just a detail ;-)

 > Some more ideas along the convenience path:
>=20
> Perhaps changing just the way 8-bit strings are coerced
> to Unicode would help: strings would then be interpreted
> as Latin-1.

ok.

> str(Unicode) and "t" would still return UTF-8 to assure loss-
> less conversion.

maybe.

or maybe str(Unicode) should return a unicode string?

think about it!

(after all, I'm pretty sure that ord() and chr() should do the right
thing, also for character codes above 127)

> Another way to tackle this would be to first try UTF-8
> conversion during auto-conversion and then fallback to
> Latin-1 in case it fails. Has anyone tried this ? Guido
> mentioned that TCL does something along these lines...

haven't found any traces of that in the source code.  hmm, you're
right -- it looks like it attempts to "fix" invalid UTF-8 data (on a
character by character basis), instead of choking on it.  scary.

    [12. In the face of ambiguity, refuse the temptation to guess.]

more tomorrow.

</F>