getting data with proper encoding to the finish

Mon Mar 14 16:24:16 EST 2005

Ksenia Marasanova wrote:
> Hi,
>
> I have a little problem with encoding. Was hoping maybe anyone can
> help me to solve it.
>
> There is some amount of data in a database (PG) that must be inserted
> into Excel sheet and emailed. Nothing special, everything works.
> Except that non-ascii characters are not displayed properly.
> The data is stored as XML into a text field.

This sentence doesn't make much sense. Explain.

> When I use pgsql it's
> displayed good in the terminal. Now I run my script and print data
> with "print" statement, still goed.

Instead of "print data", do "print repr(data)" and show us what you
get. What *you* see on the screen is not much use for diagnosis; it's
the values of the bytes in the file that matter.

> Then I use pyXLWriter to write the
> sheet,

Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
a Notepad window, save the Notepad file as Unicode type named (say)
"junk.u16"  then at the Python interactive prompt do this:

file("junk.u16", "rb").read().decode("utf16")

and show us what you get.

> and Python email package to email it... and the resulting sheet
> is not good:

E-mailed how? To whom? [I.e. what country / what cultural background /
on what machine / what operating system / viewed using what software]

>
> Г is displayed instead of ü (for example)

You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
displayed instead of U+00FC (Latin small letter U with diaeresis).

OK, we'd already guessed your background from your name :-)

However, what you see isn't necessarily what you've got. How do you
know it's not U+0393 (Greek capital letter GAMMA) or something else
that looks the same? Could even be from a line-drawing set (top left
corner of a box). What you need to do is find out the ordinal of the
character being displayed.

This type of problem arises when a character is written in one encoding
and viewed using another. I've had a quick look through various
likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
encodings that would reproduce anything like your "umlauted-u becomes
gamma-or-similar" problem. Please supply more than 1 example.