PEP 263 status check
newsgroups at jhrothjr.com
Fri Aug 6 15:33:25 CEST 2004
"Martin v. Löwis" <martin at v.loewis.de> wrote in message
news:41137799.70808 at v.loewis.de...
> John Roth wrote:
> > Or are you trying to say that the character string will
> > contain the UTF-8 encoding of these characters; that
> > is, if I do a subscript, I will get one character of the
> > multi-byte encoding?
> Michael is almost right: this is what happens. Except that
> what you get, I wouldn't call a "character". Instead, it
> is always a single byte - even if that byte is part of
> a multi-byte character.
> Unfortunately, the things that constitute a byte string
> are also called characters in the literature.
> To be more specific: In an UTF-8 source file, doing
> print "ö" == "\xc3\xb6"
> print "ö" == "\xc3"
> would print two times "True", and len("ö") is 2.
> OTOH, len(u"ö")==1.
> > The point of this is that I don't think that either behavior
> > is what one would expect. It's also an open invitation
> > for someone to make an unchecked mistake! I think this
> > may be Hallvard's underlying issue in the other thread.
> What would you expect instead? Do you think your expectation
> is implementable?
I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.
The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable. It just pushes the need for a character set
(encoding) declaration down one level of recursion.
There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.
Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?
As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that) so there were no programs out there that
put non-ascii subset characters into byte strings.
Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.
More information about the Python-list