davea at ieee.org
Fri Mar 5 17:45:36 CET 2010
>>> Ok,so you confirm that:
>>> s = u"ciao è ciao" will use the file specified encoding, and that
>>> t = "ciao è ciao"
>>> t = unicode(t)
>>> Will use, if not specified in the function, ASCII. It will ignore the
>>> encoding I specified on the top of the file. right?
>> A literal "u" string, and only such a (unicode) literal string, is
>> affected by the encoding specification. Once some bytes have been stored in
>> a 8 bit string, the system does *not* keep track of where they came from,
>> and any conversions then (even if they're on an adjacent line) will use the
>> default decoder. This is a logical example of what somebody said earlier on
>> the thread -- decode any data to unicode as early as possible, and deal only
>> with unicode strings in the program. Then, if necessary, encode them into
>> whatever output form immediately before (or while) outputting them.
> Ok Dave, What i don't understand is why:
> s = u"ciao è ciao" is converting a string to unicode, decoding it from the
> specified encoding but
> t = "ciao è ciao"
> t = unicode(t)
> That should do exactly the same instead of using the specified encoding
> always assume that if i'm not telling the function what the encoding is, i'm
> using ASCII.
> Is this a bug?
In other words, you don't understand my paragraph above. Once the
string is stored in t as an 8 bit string, it's irrelevant what the
source file encoding was. If you then (whether it's in the next line,
or ten thousand calls later) try to convert to unicode without
specifying a decoder, it uses the default encoder, which is a
application wide thing, and not a source file thing. To see what it is
on your system, use sys.getdefaultencoding().
There's an encoding specified or implied for each source file of an
application, and they need not be the same. It affects string literals
that come from that particular file. It does not affect any other
conversions, as far as I know. For that matter, many of those source
files may not even exist any more by the time the application is run.
There are also encodings attached to each file object, I believe, though
I've got no experience with that. So sys.stdout would have an encoding
defined, and any unicode strings passed to it would be converted using
The point is that there isn't just one global value, and it's a good
thing. You should figure everywhere characters come into your program
(eg. source files, raw_input, file i/o...) and everywhere characters go
out of your program, and deal with each of them individually. Don't
store anything internally as strings, and you won't create the ambiguity
you have with your 't' variable above.
More information about the Tutor