encoding problem

digisatori at gmail.com digisatori at gmail.com
Fri Dec 19 17:58:16 CET 2008


On 12月19日, 下午9时34分, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Fri, 19 Dec 2008 04:05:12 -0800, digisat... at gmail.com wrote:
> > The below snippet code generates UnicodeDecodeError.
> > #!/usr/bin/env
> > python
> > #--*-- coding: utf-8 --*--
> > s = 'äöü'
> > u = unicode(s)
>
> > It seems that the system use the default encoding- ASCII to decode the
> > utf8 encoded string literal, and thus generates the error.
>
> > The question is why the Python interpreter use the default encoding
> > instead of "utf-8", which I explicitly declared in the source.
>
> Because the declaration is only for decoding unicode literals in that
> very source file.
>
> Ciao,
>         Marc 'BlackJack' Rintsch

Thanks for the answer.
I believe the declaration is not only for unicode literals, it is for
all literals in the source even including Comments. we can try runing
a source file without encoding declaration and have only 1 line of
comments with non-ASCII characters. That will arise a Syntax error and
bring me to the pep263 URL.

I read the pep263 and quoted below:

 Python's tokenizer/compiler combo will need to be updated to work as
follows:
       1. read the file
       2. decode it into Unicode assuming a fixed per-file encoding
       3. convert it into a UTF-8 byte string
       4. tokenize the UTF-8 content
       5. compile it, creating Unicode objects from the given Unicode
data
          and creating string objects from the Unicode literal data
          by first reencoding the UTF-8 data into 8-bit string data
          using the given file encoding

The above described Python internal process indicate that the step 2
will utilise the specific encoding to decode all literals in source,
while in step5 will evolve a re-encoding with the specific encoding.

That is the reason why we have to explicitly declare a encoding as
long as we have non-ASCII in source.

Bruno answered why we need specify a encoding when decoding a byte
string with perfect explanation, Thank you very much.



More information about the Python-list mailing list