getting data with proper encoding to the finish
John Machin
sjmachin at lexicon.net
Mon Mar 14 16:24:16 EST 2005
Ksenia Marasanova wrote:
> Hi,
>
> I have a little problem with encoding. Was hoping maybe anyone can
> help me to solve it.
>
> There is some amount of data in a database (PG) that must be inserted
> into Excel sheet and emailed. Nothing special, everything works.
> Except that non-ascii characters are not displayed properly.
> The data is stored as XML into a text field.
This sentence doesn't make much sense. Explain.
> When I use pgsql it's
> displayed good in the terminal. Now I run my script and print data
> with "print" statement, still goed.
Instead of "print data", do "print repr(data)" and show us what you
get. What *you* see on the screen is not much use for diagnosis; it's
the values of the bytes in the file that matter.
> Then I use pyXLWriter to write the
> sheet,
Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
a Notepad window, save the Notepad file as Unicode type named (say)
"junk.u16" then at the Python interactive prompt do this:
file("junk.u16", "rb").read().decode("utf16")
and show us what you get.
> and Python email package to email it... and the resulting sheet
> is not good:
E-mailed how? To whom? [I.e. what country / what cultural background /
on what machine / what operating system / viewed using what software]
>
> Г is displayed instead of ü (for example)
You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
displayed instead of U+00FC (Latin small letter U with diaeresis).
OK, we'd already guessed your background from your name :-)
However, what you see isn't necessarily what you've got. How do you
know it's not U+0393 (Greek capital letter GAMMA) or something else
that looks the same? Could even be from a line-drawing set (top left
corner of a box). What you need to do is find out the ordinal of the
character being displayed.
This type of problem arises when a character is written in one encoding
and viewed using another. I've had a quick look through various
likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
encodings that would reproduce anything like your "umlauted-u becomes
gamma-or-similar" problem. Please supply more than 1 example.
More information about the Python-list
mailing list