[Tutor] Encoding

Sat Mar 6 04:29:21 CET 2010

"Giorgio" <anothernetfellow at gmail.com> wrote in message 
news:23ce85921003050915p1a084c0co73d973282d8fb6ad at mail.gmail.com...
2010/3/5 Dave Angel <davea at ieee.org>
> I think the problem is that i can't find any difference between 2 lines
> quoted above:
>
> a = u"ciao è ciao"
>
> and
>
> a = "ciao è ciao"
> a = unicode(a)

Maybe this will help:

    # coding: utf-8

    a = "ciao è ciao"
    b = u"ciao è ciao".encode('latin-1')

a is a UTF-8 string, due to #coding line in source.
b is a latin-1 string, due to explicit encoding.

    a = unicode(a)
    b = unicode(b)

Now what will happen?  unicode() uses 'ascii' if not specified, because it 
has no idea of the encoding of a or b.  Only the programmer knows.  It does 
not use the #coding line to decide.

#coding is *only* used to specify the encoding the source file is saved in, 
so when Python executes the script, reads the source and parses a literal 
Unicode string (u'...', u"...", etc.) the bytes read from the file are 
decoded using the #coding specified.

If Python parses a byte string ('...', "...", etc.) the bytes read from the 
file are stored directly in the string.  The coding line isn't even used. 
The bytes will be exactly what was saved in the file between the quotes.

-Mark