unicode issue

Wed Sep 30 18:00:55 EDT 2009

>>>>> Dave Angel <davea at ieee.org> (DA) wrote:
[snip]
>DA> Thanks for the correction. What I meant by "works for me" is that the
>DA> single example in the docstring translated okay. But I do have a lot to
>DA> learn about using Unicode in sources, and I want to learn.

>DA> So tell me, how were we supposed to guess what encoding the original
>DA> message used? I originally had the mailing list message (in Thunderbird
>DA> email). When I copied (copy/paste) to Komodo IDE (text editor), it wouldn't
>DA> let me save because the file type was ASCII. So I randomly chosen latin-1
>DA> for file type, and it seemed to like it.

You can see the encoding of the message in its headers. But it is not
important, as the Unicode characters you see is what it is about. You
just copy and paste them in your Python file. The Python file does not
have to use the same encoding as the message from which you pasted. The
editor will do the proper conversion. (If it doesn't throw it away
immediately.) Only for the Python file you must choose an encoding that
can encode all the characters that are in the file. In this case utf-8
is the only reasonable choice, but if there are only latin-1 characters
in the file then of course latin-1 (iso-8859-1) will also be good.

Any decent editor will only allow you to save in an encoding that can
encode all the characters in the file, otherwise you will lose some
characters. 

Because Python must also know which encoding you used and this is not in
itself deductible from the file contents, you need the coding
declaration. And it must be the same as the encoding in which the file
is saved, otherwise Python will see something different than you saw in
your editor. Sooner or later this will give you a big headache.

>DA> At that point I expected and got errors from Python because I had no coding
>DA> declaration. I used latin-1, and still had problems, though I forget what
>DA> they were. Only when I changed the file encoding type again, to utf-8, did
>DA> the errors go away. I agree that they should agree, but I don't know how to
>DA> reconcile the copy/paste boundary, the file type (without BOM, which is
>DA> another variable), the coding declaration, and the stdout implicit ASCII
>DA> encoding. I understand a bunch of it, but not enough to be able to safely
>DA> walk through the choices.

>DA> Is this all written up in one place, to where an experienced programmer can
>DA> make sense of it? I've nibbled at the edges (even wrote a UTF-8 
>DA> encoder/decoder a dozen years ago).

I don't know a place. Usually utf-8 is a safe bet but in some cases can
be overkill. And then in you Python input/output (read/write) you may
have to use a different encoding if the programs that you have to
communicate with expect something different.
-- 
Piet van Oostrum <piet at vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]