diferences between 22 and python 23

Fri Dec 5 01:25:05 EST 2003

On 04 Dec 2003 22:10:50 +0100, martin at v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bokr at oz.net (Bengt Richter) writes:
>
>> >Yes, and no. Yes, characters in the source code have to follow the
>> >source representation. No, they will not wind up utf-8
>> >internally. Instead, (byte) string objects have the same byte
>> >representation that they originally had in the source code.
>> Then they must have encoding info attached?
>
>I don't understand the question. Strings don't have an encoding
>information attached. Why do they have to?
Depends on what you mean by "strings" ;-) One way to look at it would be
because they originate through choosing a sequence of glyphs on key caps,
and there has to be an encoding process between that and a str's internal
representation, something like key->scan_code->[code page]->char_code_of_particular_encoding.

If you put a sequence of those in a "string," ISTM the string should be thought of as
having the same encoding as the characters whose ord() codes are stored.

If the second line of a Python source is
# -*- coding: latin-1 -*-

Then a following line

    name = 'Martin Löwis'

would presumably bind name to an internally represented string. I guess right now
it is an ascii string of type str, and if the source encoding was ascii, you would
have to write that statement as

    name = 'Martin L\xf6wis'

to get the same internal representation.

But either way, what you wanted to specify was the latin-1 glyph sequence associated
with the number sequence

 >>> map(ord, 'Martin L\xf6wis')
 [77, 97, 114, 116, 105, 110, 32, 76, 246, 119, 105, 115]

through latin-1 character interpretation. You (probably, never say never ;-) didn't
want just to specify a sequence of bytes. You put them there to be interpreted as
latin-1 at some point.

>
>> Isn't that similar to promotion in 123 + 4.56 ? 
>
>It is similar, but not the same. The answer is easy for 123+4.56.
>
>The answer would be more difficult for (4/5)+4.56 if 4/5 was a
>rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
>find a result in a reasonable way. For strings-with-attached encoding,
>the answer would always be difficult.
Why, when unicode includes all?

>
>In the face of ambiguity, refuse the temptation to guess.
>
>> We already do that to some extent:
>>  >>> 'abc' + u'def'
>>  u'abcdef'
>
>Yes, and that is only possible because the system encoding is
>ASCII. So regardles of what the actual encoding of the string is,
Um, didn't you say, "Strings don't have an encoding information attached.
Why do they have to?" ? What's this about ASCII? ;-)

>assuming it is ASCII will give the expected result, as ASCII is a
 ^^^^^^^^ oh, ok, it's just an assumption.

>universal subset of (nearly) all encodings.
>
>> But there is another question, and that is whether a concrete
>> encoding of characters really just represents characters, or whether
>> the intent is actually to represent a concrete encoding as such
>> (including the info as to which encoding it is).
>
>More interestingly: Do the strings represent characters AT ALL? Some
>strings don't represent characters, but bytes.
Again it comes down to defining terms ISTM.

>
>What is the advantage of having an encoding associated with byte
>strings?
If e.g. name had latin-1 encoding associated with it by virtue of source like
    ...
    # -*- coding: latin-1 -*-
    name = 'Martin Löwis'

then on my cp437 console window, I might be able to expect to see the umlaut
just by writing

    print name	# implicit conversion from associated encoding to output device encoding

instead of having to write

    print name.decode('latin-1').encode('cp437')

or something different on idle, etc.

Instead, there is a you-know-what-you're-doing implicit reinterpret-cast of
the byte string bound to name to whatever-type-the-output-device-currently-attached-is.

Definitely that is necessary functionality, but explicit might be better than implicit.
E.g., one might spell it

    print name.bytes()  # meaning expose binary data byte sequence for name's encoding.
                        # the repr format would be like current str ascii-with-escapes

and

    bytes = name.bytes()

would result in pure 8-bit data bytes with no implied 'ascii' association whatever. (The
7-bit-ascii-with-escapes repr would only be a data print format, with no other implication).

This could be followed by

    s = bytes.associate('latin-1')

to reconstitute the string-of-bytes-with-associated-encoding

>
>> 1. This is a pure character sequence, use the source representation
>> only to determine what the abstract character entities (ACEs) are,
>> and represent them as necessary to preserve their unified
>> identities.
>
>In that case, you should use Unicode literals: They do precisely that.
Why should I have to do that if I have written # -*- coding: latin-1 -*-
in the second line? Why shouldn't s='blah blah' result in s being internally
stored as a latin-1 glyph sequence instead of an 8-bit code sequence that will
trip up ascii assumptions annoyingly ;-)

>
>> BTW, for convenience, will 8-bit byte encoded strings be repr'd as
>> latin-1 + escapes?
>
>Currently, they are represented as ASCII+escapes. I see no reason to
>change that.
Ok, that's no biggie, but even with your name? ;-)

>
>> Still, they have to express that in the encoding(s) of the program
>> sources, so what will '...' mean? Must it not be normalized to a
>> common internal representation?
>
>At some point in time, '...' will mean the same as u'...'. A Unicode
interesting. Will u'...' mean Unicode in the abstract, reserving the
the choice of utf-16(le|be)/wchar or utf-8 to the implementation?
>object *is* a normalized representation of a character string.
Sure. But it will have different possible encodings when you want to
send it to another system or store it etc.

>
>There should be one-- and preferably only one --obvious way to do it.
>
>... and Unicode strings are the one obvious way to do a normalized
>representation. You should use Unicode literals today whereever
>possible.
>
>> BTW, does import see encoding cookies and do the right thing when
>> there are differing ones?
>
>In a single file? It is an error to have multiple encoding cookies in
>a single file.
I didn't mean that ;-)
>
>In multiple files? Of course, that is the entire purpose: Allow
>different encodings in different modules. If only a single encoding
>was used, there would be no need to declare that.
Yes that seems obvious, but I had some inkling that if two modules
m1 and m2 had different source encodings, different codes would be
allowed in '...' literals in each, and e.g.,

    import m1,m2
    print 'm1: %r, m2: %r' % (m1.s1, m2.s2)

might have ill-defined meaning, which perhaps could be resolved by strings carrying
encoding info along. Of course, if all '...' wind up equivalent to u'...' then that
pretty much goes away (though I suppose %r might not be a good short cut for getting
a plain quoted string into the output any more).

But if s = '...' becomes effectively s = u'...' will type('...') => <type 'unicode'> ?

What will become of str? Will that still be the default pseudo-ascii-but-really-byte-string
general data container that is is now?

Regards,
Bengt Richter