[Tutor] Encoding

Dave Angel davea at ieee.org
Fri Mar 5 20:26:29 CET 2010


Giorgio wrote:
> 2010/3/5 Dave Angel <davea at ieee.org>
>   
>> In other words, you don't understand my paragraph above.
>>     
>
>
> Maybe. But please don't be angry. I'm here to learn, and as i've run into a
> very difficult concept I want to fully undestand it.
>
>
>   
I'm not angry, and I'm sorry if I seemed angry.  Tone of voice is hard 
to convey in a text message.
>> Once the string is stored in t as an 8 bit string, it's irrelevant what the
>> source file encoding was.
>>     
>
>
> Ok, you've said this 2 times, but, please, can you tell me why? I think
> that's the key passage to understand how encoding of strings works. The
> source file encoding affects all file lines, also strings.
Nope, not strings.  It only affects string literals.
>  If my encoding is
> UTF8 python will read the string "ciao è ciao" as 'ciao \xc3\xa8 ciao' but
> if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant?
>
> I think the problem is that i can't find any difference between 2 lines
> quoted above:
>
> s = u"ciao è ciao"
>
> and
>
> t = "ciao è ciao"
> c = unicode(t)
>
> [**  I took the liberty of making the variable names different so I can refer to them **]
>   
I'm still not sure whether your confusion is to what the rules are, or 
why the rules were made that way.  The rules are that an unqualified 
conversion, such as the unicode() function with no second argument, uses 
the default encoding, in strict mode.  Thus the error.

Quoting the help: 
"If no optional parameters are given, unicode() will mimic the behaviour 
of str() except that it returns Unicode strings instead of 8-bit 
strings. More precisely, if /object/ is a Unicode string or subclass it 
will return that Unicode string without any additional decoding applied.

For objects which provide a __unicode__() 
<../reference/datamodel.html#object.__unicode__> method, it will call 
this method without arguments to create a Unicode string. For all other 
objects, the 8-bit string version or representation is requested and 
then converted to a Unicode string using the codec for the default 
encoding in 'strict' mode.
"

As for why the rules are that, I'd have to ask you what you'd prefer.  
The unicode() function has no idea that t was created from a literal 
(and no idea what source file that literal was in), so it has to pick 
some coding, called the default coding.  The designers decided to use a 
default encoding of ASCII, because manipulating ASCII strings is always 
safe, while many functions won't behave as expected when given UTF-8 
encoded strings.  For example, what's the 7th character of t ?  That is 
not necessarily the same as the 7th character of s, since one or more of 
the characters in between might have taken up multiple bytes in s.  That 
doesn't happen to be the case for your accented character, but would be 
for some other European symbols, and certainly for other languages as well.
>> If you then (whether it's in the next line, or ten thousand calls later)
>> try to convert to unicode without specifying a decoder, it uses the default
>> encoder, which is a application wide thing, and not a source file thing.  To
>> see what it is on your system, use sys.getdefaultencoding().
>>
>>     
>
> And this is ok. Spir said that it uses ASCII, you now say that it uses the
> default encoder. I think that ASCII on spir's system is the default encoder
> so.
>
>
>   
I don't know, but I think it's the default in every country, at least on 
version 2.6.  It might make sense to get some value from the OS that 
defined the locally preferred encoding, but then a program that worked 
fine in one locale might fail miserably in another.
>> The point is that there isn't just one global value, and it's a good thing.
>>  You should figure everywhere characters come into  your program (eg. source
>> files, raw_input, file i/o...) and everywhere characters go out of your
>> program, and deal with each of them individually.
>>     
>
>
> Ok. But it always happen this way. I hardly ever have to work with strings
> defined in the file.
>
>   
Not sure what you mean by "the file."  If you mean the source file, 
that's what your examples are about.   If you mean a data file, that's 
dealt with differently.
>   
>> Don't store anything internally as strings, and you won't create the
>> ambiguity you have with your 't' variable above.
>>
>> DaveA
>>
>>     
>
> Thankyou Dave
>
> Giorgio
>
>
>
>   



More information about the Tutor mailing list