usage of <string>.encode('utf-8','xmlcharrefreplace')?
Carsten Haese
carsten at uniqsys.com
Tue Feb 19 08:12:06 EST 2008
On Mon, 18 Feb 2008 22:24:56 -0800 (PST), J Peyret wrote
> [...]
> You are right, I am confused about unicode. Guilty as charged.
You should read http://www.amk.ca/python/howto/unicode to clear up some of
your confusion.
> [...]
> Also doesn't help that I am not sure what encoding is used in the
> data file that I'm using.
That is, incidentally, the direct cause of the error message below.
> [...]
> <class 'psycopg2.ProgrammingError'>
> invalid byte sequence for encoding "UTF8": 0x92
> HINT: This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".
What this error message means is that you've given the database a byte string
in an unknown encoding, but you're pretending (by default, i.e. by not telling
the database otherwise) that the string is utf-8 encoded. The database is
encountering a byte that should never appear in a valid utf-8 encoded byte
string, so it's raising this error, because your string is meaningless as
utf-8 encoded text.
This is not surprising, since you don't know the encoding of the string. Well,
now we know it's not utf-8.
> column is a varchar(2000) and the "guilty characters" are those used
> in my posting.
I doubt that. The error message is complaining about a byte with the value
0x92. That byte appeared nowhere in the string you posted, so the error
message must have been caused by a different string.
Now for the solution of your problem: If you don't care what the encoding of
your byte string is and you simply want to treat it as binary data, you should
use client_encoding "latin-1" or "iso8859_1" (they're different names for the
same thing). Since latin-1 simply maps the bytes 0 to 255 to unicode code
points 0 to 255, you can store any byte string in the database, and get the
same byte string back from the database. (The same is not true for utf-8 since
not every random string of bytes is a valid utf-8 encoded string.)
Hope this helps,
--
Carsten Haese
http://informixdb.sourceforge.net
More information about the Python-list
mailing list