unicode confusing

Paul Boddie paul at boddie.org.uk
Mon May 25 12:07:42 EDT 2009


On 25 Mai, 17:39, someone <petshm... at googlemail.com> wrote:
> Hi,
>
> reading content of webpage (encoded in utf-8) with urllib2, I can't
> get parsed data into DB
>
> Exception:
>
>   File "/usr/lib/python2.5/site-packages/pyPgSQL/PgSQL.py", line 3111,
> in execute
>     raise OperationalError, msg
> libpq.OperationalError: ERROR:  invalid UTF-8 byte sequence detected
> near byte 0xe4
>
> I've already checked several python unicode tutorials, but I have no
> idea how to solve my problem.

With pyPgSQL, there are a few tricks that you have to take into
account:

1. With PostgreSQL, it would appear advantageous to create databases
using the "-E unicode" option.

2. When connecting, use the client_encoding and unicode_results
arguments for the connect function call:

  connection = PgSQL.connect(client_encoding="utf-8",
unicode_results=1)

3. After connecting, it appears necessary to set the client encoding
explicitly:

  connection.cursor().execute("set client_encoding to unicode")

I'd appreciate any suggestions which improve on the above, but what
this should allow you to do is to present Unicode objects to the
database and to receive such objects from queries. Whether you can
relax this and pass UTF-8-encoded strings instead of Unicode objects
is not something I can guarantee, but it's usually recommended that
you manipulate Unicode objects in your program where possible, and
here you should be able to let pyPgSQL deal with the encodings
preferred by the database.

Paul



More information about the Python-list mailing list