diferences between 22 and python 23

Thu Dec 4 16:10:50 EST 2003

bokr at oz.net (Bengt Richter) writes:

> >Yes, and no. Yes, characters in the source code have to follow the
> >source representation. No, they will not wind up utf-8
> >internally. Instead, (byte) string objects have the same byte
> >representation that they originally had in the source code.
> Then they must have encoding info attached?

I don't understand the question. Strings don't have an encoding
information attached. Why do they have to?

> Isn't that similar to promotion in 123 + 4.56 ? 

It is similar, but not the same. The answer is easy for 123+4.56.

The answer would be more difficult for (4/5)+4.56 if 4/5 was a
rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
find a result in a reasonable way. For strings-with-attached encoding,
the answer would always be difficult.

In the face of ambiguity, refuse the temptation to guess.

> We already do that to some extent:
>  >>> 'abc' + u'def'
>  u'abcdef'

Yes, and that is only possible because the system encoding is
ASCII. So regardles of what the actual encoding of the string is,
assuming it is ASCII will give the expected result, as ASCII is a
universal subset of (nearly) all encodings.

> But there is another question, and that is whether a concrete
> encoding of characters really just represents characters, or whether
> the intent is actually to represent a concrete encoding as such
> (including the info as to which encoding it is).

More interestingly: Do the strings represent characters AT ALL? Some
strings don't represent characters, but bytes.

What is the advantage of having an encoding associated with byte
strings?

> 1. This is a pure character sequence, use the source representation
> only to determine what the abstract character entities (ACEs) are,
> and represent them as necessary to preserve their unified
> identities.

In that case, you should use Unicode literals: They do precisely that.

> BTW, for convenience, will 8-bit byte encoded strings be repr'd as
> latin-1 + escapes?

Currently, they are represented as ASCII+escapes. I see no reason to
change that.

> Still, they have to express that in the encoding(s) of the program
> sources, so what will '...' mean? Must it not be normalized to a
> common internal representation?

At some point in time, '...' will mean the same as u'...'. A Unicode
object *is* a normalized representation of a character string.

There should be one-- and preferably only one --obvious way to do it.

... and Unicode strings are the one obvious way to do a normalized
representation. You should use Unicode literals today whereever
possible.

> BTW, does import see encoding cookies and do the right thing when
> there are differing ones?

In a single file? It is an error to have multiple encoding cookies in
a single file.

In multiple files? Of course, that is the entire purpose: Allow
different encodings in different modules. If only a single encoding
was used, there would be no need to declare that.

Regards,
Martin