Encoding troubles
Bryan
bryanjugglercryptographer at yahoo.com
Mon May 17 22:38:51 EDT 2010
Neil Hodgson wrote:
> JB:
>
> > as hypens (–) and apostrophes (’) are in an odd encoding. When passed
> > to the database using sqlalchemy they appear as – and other
> > characters.
>
> The encoding is UTF-8. Normally the best way to handle encodings is
> to convert to Unicode strings (unicode(s, "UTF-8")) as soon as possible
> and perform most processing in Unicode.
Good advice to work in Unicode (and in Python 3.X str is unicode), but
I'd guess the encoding he's getting is "Windows-1252". The default
character set of HTTP is ISO-8859-1, but Microsoft likes to use
Windows-1252 in it's place.
What to do about it? First, try specifying utf-8 in the form
containing the textarea, as in
<form action="process.cgi" accept-charset="utf-8">
Note that specifying ISO-8859-1 will not work, in that Microsoft will
still use Windows-1252. I've heard they've gotten better at supporting
utf-8, but I haven't tested.
When a request comes in, check for a Content-Type header that names
the character set, which should be:
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Then you con decode to a unicode object as Neil Hodgson explained.
In case you still have to deal with Windows-1252, Python knows how to
translate Windows-1252 to the best-fit in Unicode. In current Python
2.x:
ustring = unicode(raw_string, 'Windows-1252')
In Python 3.X, what comes from a socket is bytes, and str means
unicode:
ustring = str(raw_bytes, 'Windows-1252')
Of course this all assumes that JB's database likes Unicode. If it
chokes, then alternatives include encoding back to utf-8 and storing
as binary, or translating characters to some best-fit in the set the
database supports.
--
--Bryan Olson
More information about the Python-list
mailing list