Why asci-only symbols?

Bengt Richter bokr at oz.net
Mon Oct 17 22:36:42 EDT 2005


On Tue, 18 Oct 2005 01:34:09 +0200, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin at v.loewis.de> wrote:

>Bengt Richter wrote:
>> Well, what will be assumed about name after the lines
>> 
>> #-*- coding: latin1 -*-
>> name = 'Martin Löwis' 
>> 
>> ?
>
>Are you asking what is assumed about the identifier 'name', or the value
>bound to that identifier? Currently, the identifier must be encoded in 
>latin1 in this source code, and it must only consist of letters, digits,
>and the underscore.
>
>The value of name will be a string consisting of the bytes
>4d 61 72 74 69 6e 20 4c f6 77 69 73

Which is the latin-1 encoding. Ok, so far so good. We know it's latin1, but the knowledge
is lost to python.

>
>> I know type(name) will be <type 'str'> and in itself contain no encoding information now,
>> but why shouldn't the default assumption for literal-generated strings be what the coding
>> cookie specified?
>
>That certainly is the assumption: string literals must be in the
>encoding specified in the source encoding, in the source code file
>on disk. If they aren't (and cannot be interpreted that way), you
>get a syntax error.
I meant the "literal-generated string" (internal str instance representation compiled
from the latin1-encoded source string literal.
>
>> I know the current implementation doesn't keep track of the different
>> encodings that could reasonably be inferred from the source of the strings, 
> > but we are talking about future stuff here ;-)
>
>Ah, so you want the source encoding to be preserved, say as an attribute
>of the string literal. This has been discussed many times, and was
>always rejected.
Not of the string literal per se. That is only one (constant) expression resulting
in a str instance. I want (for the sake of this discussion ;-) the str instance
to have an encoding attribute when it can reliably be inferred, as e.g. when a coding
cookie is specified and the str instance comes from a constant literal string expression.
>

>Some people reject it because it is overkill: if you want reliable,
>stable representation of characters, you should use Unicode strings.
>
>Others reject it because of semantic difficulties: how would such
>strings behave under concatenation, if the encodings are different?
I mentioned that in parts you snipped (2nd half here):
"""
Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None, but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).
"""
>
>> #-*- coding: latin1 -*-
>> name = 'Martin Löwis' 
>> 
>> could be that name.encoding == 'latin-1'
>
>That is not at all intuitive. I would have expected name.encoding
>to be 'latin1'.
That's pretty dead-pan. Not even a smiley ;-)

>
>> Functions that generate strings, such as chr(), could be assumed to create
>> a string with the same encoding as the source code for the chr(...) invocation.
>
>What is the source of the chr invocation? If I do chr(param), should I 
The source file that the "chr(param)" appears in.
>use the source where param was computed, or the source where the call
No, the param is numeric, and has no reasonably inferrable encoding. (I don't
propose to have ord pass it on for integers to carry ;-) (so ord in another
module with different source encoding could be the source and an encoding
conversion could happen with integer as intermediary. But that's expected ;-)

>to chr occurs? If the latter, how should the interpreter preserve the
>encoding of where the call came from?
not this latter, so not applicable.
>
>What about the many other sources of byte strings (like strings read 
>from a file, or received via a socket)?
I mentioned that in parts you snipped. See above.

>
>> This is not a fully developed idea, and there has been discussion on the topic before
>> (even between us ;-) but I thought another round might bring out your current thinking
>> on it ;-)
>
>My thinking still is the same. It cannot really work, and it wouldn't do 
>any good with what little it could do. Just use Unicode strings.
>
To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)

Anyway, ok, I'll leave it at that, but I'm not altogether happy with having to write

    #-*- coding: latin1 -*-
    name = 'Martin Löwis' 
    print name.decode('latin1')

where I think

    #-*- coding: latin1 -*-
    name = 'Martin Löwis' 
    print name

should reasonably produce the same output. Though I grant you

    #-*- coding: latin1 -*-
    name = u'Martin Löwis' 
    print name

is not that hard to do. (Please excuse the use of your name, which has a handy non-ascii letter ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list