diferences between 22 and python 23

Sat Dec 6 22:14:25 EST 2003

On 06 Dec 2003 18:20:57 +0100, martin at v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bokr at oz.net (Bengt Richter) writes:
>
>> >> Why, when unicode includes all?
>> >
>> >Because at the end, you would produce a byte string. Then the question
>> >is what type the byte string should have.
>> Unicode, of course, unless that coercion was not necessary, as in ascii+ascii
>> or latin-1 + latin-1, etc., where the result could retain the more specific
>> encoding attribute.
>
>I meant to write "what *encoding* the byte string should have". Unicode
>is not an encoding.
True (conceptually correct ;-). I was being sloppy and using "unicode" as
a metonym for "any unicode encoding" (the choice would be an implementation/optimization
concern. The point being to preserve character identity information from the original
string encodings in a combined new encoding).
>
>> Why not assume latin-1, if it's just a convenience assumption for certain
>> contexts? I suspect it would be right more often than not, given that for
>> other cases explicit unicode or decode/encode calls would probably be used.
>
>This was by BDFL pronouncement, and I agree with that decision. I
>personally would have favoured UTF-8 as system encoding in Python, as
>it would support all languages, and would allow for as little mistakes
>as ASCII (e.g. you can't mistake a Latin-1 or KOI-8R string as UTF-8).
>I would consider chosing Latin-1 as euro-centric, and it would
>silently do the wrong thing if the actual encoding was something else.
We still can get silent wrongdoing I think, but ok, never mind latin-1.
Probably not a good idea. It's a red herring for now anyway.
>
>Errors should never pass silently.
>Unless explicitly silenced.
Ok, I'm happy with that. But let's see where the errors come from.
By definition it's from associating the wrong encoding assumption
with a pure byte sequence. So how can that happen?

1. a pure byte sequence is used in a context that requires
   interpretation as a character sequence, and a wrong default is assumed

2. the program code has a bug and passes explicitly wrong information

We'll ignore #2, but how can a pure byte sequence get to situation #1 ?

1a. Available unambiguous encoding information not matching the
    default assumption was dropped. This is IMO the most likely.
1b. The byte sequence came from an unspecified source and never got explicit encoding info associated.
    This is probably a bug or application design flaw, not a python problem.

IMO a large part of the answer will be not to drop available encoding info.

>
>> >name = u'Martin Löwis'
>> >print name
>> Right, but that is a workaround w.r.t the possibility I am trying to
>> discuss.
>
>The problem is that the possibility is not a possibility. What you
>propose just cannot be implemented in a meaningful way. If you don't
>believe me, please try implementing it yourself, and I'll show you the
>problems of your implementation.
I hope an outline of what I am thinking is becoming visible.

>
>Using Unicode objects to represent characters is not a work-around, it
>is the solution.
>
>> Care to elaborate? I don't know what difficult questions nor
>> non-intuitive behavior you have in mind, but I am probably not the
>> only one who is curious ;-)
>
>As I said: What would be the meaning of concatenating strings, if both
>strings have different encodings?
If the strings have encodings, the semantics are the semantics of character
sequences with possibly heterogenous representations. The simplest thing would probably
be to choose utf-16le like windows wchar UIAM and normalize all strings that have
encodings to that, but you could postpone that and get more efficient memory use
by introducing a mutable coding property for str instances, so that e.g.,

     u'abc'.encode('latin-1').coding => 'latin-1'

i.e., when you encode a character sequence into a byte sequence according to a certain
encoding, you get a str instance with a .coding attribute that says what the encoding is.
The bytes of the str are just like now. A plain byte str would have .coding == None.
plain string syntax in a program source would work like

     'abc'.coding => (whatever source encoding is) # not necessarily 'ascii'

This leaves the case where you explicitly want an actual pure byte string, with
no encoding. IMO '...' should _not_ produce that, because by far the most common use
is to encode characters for docstrings and printing and displaying of readable characters.
I.e., for character sequences, not data byte sequences. I don't think we should have to convert
all those string literals to u'...' for the sake of the few actual data-byte-string literals.
Instead, the latter could become explicit, e.g., by a string prefix. E.g.,

     a'...'

meaning a byte string represented by ascii+escapes syntax like current practice (whatever the program
source encoding. I.e., latin-1 non-ascii characters would not be allowed in the literal _source_
representation even if the program source were encoded in latin-1. (of course escapes would be allowed)).
This would make such literals fairly portable, IWT. Also, a'foo'.coding => None, to indicate
a pure byte string.

By contrast, a plainly quoted string would get its .coding attribute value from the source encoding,
and any characters permissible in the program source would be carried through, and the string
.coding attribute would be the same as for the program source.

IWT .coding attributes/properties would permit combining character strings with different
encodings by promoting to an encoding that includes all without information loss. User custom
objects presenting string behaviour could also have a .coding attribute to feed into
the system seamlessly.

Of course you cannot arbitrarily combine byte strings b (b.coding==None)
with character strings s (s.coding!=None).

If "strings" don't have encodings, they are not character sequences, they
are byte vectors. It should probably be a TypeError to pass a byte vector
without associated explicit encoding info where a character sequence is expected,
though for backward compatibility the coding='ascii' assumption will probably have
to be made.

With .coding attributes, print could do what it does with unicode strings now, and encode to
the current output device's coding.

>
>I see three possible answers to this question, all non-intuitive:
>1. Choose one of the encodings, and convert the other string to
>   that encoding. This has these problems:
>   a) neither encoding might be capable of representing all characters
>      of the result string. There are several ways to deal with this
>      case; finding them is left as an exercise to the reader.
       Fairly obviously IWT, convert to a unicode encoding.
>   b) it would be incompatible with prior versions, as it would
>      not be a plain byte concatenation.
       IWT it would be plain byte concatenation when encodings were compatible.
       Otherwise plain concatenation would be an error anyway.
>2. Convert the result string to UTF-8. This is incompatible with
>   earlier Python versions.
    Or utf-16xx. I wonder how many mixed-encoding situations there are in earlier code.
    Single-encoding should not require change of encoding, so it should look like plain
    concatenation as far as the byte sequence part is concerned. It might be mostly transparent.

>3. Consider the result as having "no encoding". This would render
>   the entire feature useless, as string data would degrade to
>   "no encoding" very quickly. This, in turn, would leave to "strange"
>   errors, as sometimes, printing a string works fine, but seemingly
>   randomly, it fails.
    I agree that this would be useless ;-)
>
>Also, what would be the encoding of strings returned from file.read(),
>socket.read(), etc.?

    socket_or_file.read().coding => None

unless some encoding was specified in the opening operation.
>
>Also, what would be the encoding of strings created as a result of
>splice operations? What if the splice hits the middle of a multi-byte
>encoding?
That can't happen (unless you convert an encoded string to its unencoded byte sequence,
and somehow get a splice operation to work at that spot, in which case you deserve the result ;-)
Remember, if there is encoding, we are semantically dealing with character sequences,
so splicing has to be implemented in terms of characters, however represented.
It means that e.g., u'%s'% slatin_1 would require promotion of latin-1 encoding to whatever
unicode encoding you are using for u'...'.

But note that if slatin_1 bytes were 'L\xf6wis' and slatin_1.coding were set to 'latin-1' one
way or another, then there wouldn't have to be an implicit 'ascii' assumption to correct by writing

    u'%s' % slatin_1.decode('latin-1')
instead of
    u'%s' % slatin_1

>
>> No, I know that ;-) But I don't know how you are going to migrate towards
>> a more pervasive use of unicode in all the '...' contexts. Whether at
>> some point unicode will be built into cpython as the C representation
>> of all internal strings
>
>Unicode is not a representation of byte strings, so this cannot
>happen.
Sorry, for internal stuff I meant strings as in character strings, not byte strings ;-)

>
>> or it will use unicode through unicode objects
>> and their interfaces, which I imagine would be the way it started.
>
>Yes, all library functions that expect strings should support Unicode
>objects. Ideally, all library functions that return strings should
>return Unicode objects, but this raises backwards compatibility
>issues. For the APIs where this matters much, transition mechanisms
>are in progress.
>
>> Memory-limited implementations might want to make different choices IWG,
>> so the cleaner the python-unicode relationship the freer those choices
>> are likely to be IWT.
>
>I'm not too concerned with memory-limited implementations. It would be
>feasible to re-implement the Unicode type to use UTF-8 as its internal
>representation, but that would be tedious to do on the C level, and it
>would lead to really bad performance, given that slicing and indexing
>become inefficient.
>
>> >>     import m1,m2
>> >>     print 'm1: %r, m2: %r' % (m1.s1, m2.s2)
>> >> 
>> >> might have ill-defined meaning
>> >
>> >That is just one of the problems you run into when associating
>>                                                    ^--not ;-)
>> >encodings with strings. Fortunately, there is no encoding associated
>> >with a byte string.
>> So assume ascii, after having stripped away better knowledge?
>
>No, in current Python, there is no doubt about the semantics: We
>assume *nothing* about the encoding. Instead, if s1 and s2 are <type
 ^^^^^^^^^^^^^^^^-- If that is so, why does str have an encode method?
That's supposed to go from character entities to bytes, I thought ;-)

>'str'>, we treat them as byte strings. This means that bytes 0..31 and
>128..256 are escaped, with special escapes applying to 10, 13, ...,
>and bytes 34 and 39.
>
>> It's fine to have a byte type with no encoding associated. But
>> unfortunately ISTM str instances seem to be playing a dual role as
>> ascii-encoded strings and byte strings. More below.
>
>No. They actually play a dual role as byte strings and somehow-encoded
>strings, depending on the application. In many applications, that
>encoding is the locale's encoding, but in internet applications, you
>often have to handle multiple encodings in a single run of the
>program.
Which is why I thought some_string.coding attributes to carry that
information explicitly would be a good idea.
>
>> How will the following look when s == '...' becomes effectively s =
>> u'...' per above?
>
>I don't know. Because this question is difficult to answer, that
>change cannot be made in the near future. It might be reasonable to
>have str() return Unicode objects - with another builtin to generate
>byte strings.
I'm not sure what that other builtin would do. Call on objects' __bytes__
methods? IWT most of the current str machinery could still operate on
byte strings, just by astring.coding=None to differentiate from actual
character-oriented processing.

>
>> BTW, is '...' =(effectively)= u'...' slated for a particular future
>> python version?
>
>No. Try running your favourite application with -U, and see what
>happens. For Python 2.3, I managed python -U to atleast enter
>interactive mode - in 2.2, importing site.py will fail, trying
>to put Unicode objects on sys.path.
Haven't tried that yet...

Regards,
Bengt Richter