[Tutor] Encoding

Dave Angel davea at ieee.org
Fri Mar 5 17:45:36 CET 2010


Giorgio wrote:
>>     
>>> Ok,so you confirm that:
>>>
>>> s = u"ciao è ciao" will use the file specified encoding, and that
>>>
>>> t = "ciao è ciao"
>>> t = unicode(t)
>>>
>>> Will use, if not specified in the function, ASCII. It will ignore the
>>> encoding I specified on the top of the file. right?
>>>
>>>
>>>
>>>       
>> A literal  "u" string, and only such a (unicode) literal string, is
>> affected by the encoding specification.  Once some bytes have been stored in
>> a 8 bit string, the system does *not* keep track of where they came from,
>> and any conversions then (even if they're on an adjacent line) will use the
>> default decoder.  This is a logical example of what somebody said earlier on
>> the thread -- decode any data to unicode as early as possible, and deal only
>> with unicode strings in the program.  Then, if necessary, encode them into
>> whatever output form immediately before (or while) outputting them.
>>
>>
>>
>>     
>  Ok Dave, What i don't understand is why:
>
> s = u"ciao è ciao" is converting a string to unicode, decoding it from the
> specified encoding but
>
> t = "ciao è ciao"
> t = unicode(t)
>
> That should do exactly the same instead of using the specified encoding
> always assume that if i'm not telling the function what the encoding is, i'm
> using ASCII.
>
> Is this a bug?
>
> Giorgio
>   
In other words, you don't understand my paragraph above.  Once the 
string is stored in t as an 8 bit string, it's irrelevant what the 
source file encoding was.  If you then (whether it's in the next line, 
or ten thousand calls later) try to convert to unicode without 
specifying a decoder, it uses the default encoder, which is a 
application wide thing, and not a source file thing.  To see what it is 
on your system, use sys.getdefaultencoding().

There's an encoding specified or implied for each source file of an 
application, and they need not be the same.  It affects string literals 
that come from that particular file. It does not affect any other 
conversions, as far as I know.  For that matter, many of those source 
files may not even exist any more by the time the application is run.

There are also encodings attached to each file object, I believe, though 
I've got no experience with that.  So sys.stdout would have an encoding 
defined, and any unicode strings passed to it would be converted using 
that specification.

The point is that there isn't just one global value, and it's a good 
thing.  You should figure everywhere characters come into  your program 
(eg. source files, raw_input, file i/o...) and everywhere characters go 
out of your program, and deal with each of them individually.  Don't 
store anything internally as strings, and you won't create the ambiguity 
you have with your 't' variable above.

DaveA


More information about the Tutor mailing list