[Tutor] Encoding
Dave Angel
davea at ieee.org
Fri Mar 5 20:26:29 CET 2010
Giorgio wrote:
> 2010/3/5 Dave Angel <davea at ieee.org>
>
>> In other words, you don't understand my paragraph above.
>>
>
>
> Maybe. But please don't be angry. I'm here to learn, and as i've run into a
> very difficult concept I want to fully undestand it.
>
>
>
I'm not angry, and I'm sorry if I seemed angry. Tone of voice is hard
to convey in a text message.
>> Once the string is stored in t as an 8 bit string, it's irrelevant what the
>> source file encoding was.
>>
>
>
> Ok, you've said this 2 times, but, please, can you tell me why? I think
> that's the key passage to understand how encoding of strings works. The
> source file encoding affects all file lines, also strings.
Nope, not strings. It only affects string literals.
> If my encoding is
> UTF8 python will read the string "ciao è ciao" as 'ciao \xc3\xa8 ciao' but
> if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant?
>
> I think the problem is that i can't find any difference between 2 lines
> quoted above:
>
> s = u"ciao è ciao"
>
> and
>
> t = "ciao è ciao"
> c = unicode(t)
>
> [** I took the liberty of making the variable names different so I can refer to them **]
>
I'm still not sure whether your confusion is to what the rules are, or
why the rules were made that way. The rules are that an unqualified
conversion, such as the unicode() function with no second argument, uses
the default encoding, in strict mode. Thus the error.
Quoting the help:
"If no optional parameters are given, unicode() will mimic the behaviour
of str() except that it returns Unicode strings instead of 8-bit
strings. More precisely, if /object/ is a Unicode string or subclass it
will return that Unicode string without any additional decoding applied.
For objects which provide a __unicode__()
<../reference/datamodel.html#object.__unicode__> method, it will call
this method without arguments to create a Unicode string. For all other
objects, the 8-bit string version or representation is requested and
then converted to a Unicode string using the codec for the default
encoding in 'strict' mode.
"
As for why the rules are that, I'd have to ask you what you'd prefer.
The unicode() function has no idea that t was created from a literal
(and no idea what source file that literal was in), so it has to pick
some coding, called the default coding. The designers decided to use a
default encoding of ASCII, because manipulating ASCII strings is always
safe, while many functions won't behave as expected when given UTF-8
encoded strings. For example, what's the 7th character of t ? That is
not necessarily the same as the 7th character of s, since one or more of
the characters in between might have taken up multiple bytes in s. That
doesn't happen to be the case for your accented character, but would be
for some other European symbols, and certainly for other languages as well.
>> If you then (whether it's in the next line, or ten thousand calls later)
>> try to convert to unicode without specifying a decoder, it uses the default
>> encoder, which is a application wide thing, and not a source file thing. To
>> see what it is on your system, use sys.getdefaultencoding().
>>
>>
>
> And this is ok. Spir said that it uses ASCII, you now say that it uses the
> default encoder. I think that ASCII on spir's system is the default encoder
> so.
>
>
>
I don't know, but I think it's the default in every country, at least on
version 2.6. It might make sense to get some value from the OS that
defined the locally preferred encoding, but then a program that worked
fine in one locale might fail miserably in another.
>> The point is that there isn't just one global value, and it's a good thing.
>> You should figure everywhere characters come into your program (eg. source
>> files, raw_input, file i/o...) and everywhere characters go out of your
>> program, and deal with each of them individually.
>>
>
>
> Ok. But it always happen this way. I hardly ever have to work with strings
> defined in the file.
>
>
Not sure what you mean by "the file." If you mean the source file,
that's what your examples are about. If you mean a data file, that's
dealt with differently.
>
>> Don't store anything internally as strings, and you won't create the
>> ambiguity you have with your 't' variable above.
>>
>> DaveA
>>
>>
>
> Thankyou Dave
>
> Giorgio
>
>
>
>
More information about the Tutor
mailing list