[Python-3000] Bytes and unicode conversion in C extensions

Antoine Pitrou solipsis at pitrou.net
Tue Jul 29 18:55:40 CEST 2008


Hi,

> In 3.0 the string type is replaced by unicode. A new "byte" type is
> added. So, code like "db.put('key','value')" needs to be changed to
> "db.put(bytes('key', 'utf-8'), bytes('value', 'utf-8'))", or something
> similar.

Why not "db.put(b'key', b'value')"?

> This is ugly and generates incompatible code with previous python releases.

3.0 *is* meant to break compatibility with previous versions. Of course, it
would be better if databases created with 2.x could be opened using 3.0 without
any complications.

As for "ugly", this is avoided using the b"..." literal syntax.

> In fact the
> approach in 3.0 is the right one, and any try to hide this difference
> with proxy objects or automatic conversion is going to bite us, someday.

+1.
Or as Amaury suggests, you could add an option, when connecting to a bsddb
database, to open it in binary or text mode. But it seems to me that the bsddb
storage is inherently binary, if it doesn't care about encodings.

> So, I'm thinking seriously in accepting *ONLY* "bytes" in the bsddb API
> (when working under Python 3.0), and do the proxy thing *ONLY* in the
> testsuite, to be able to reuse it.

You needn't do any proxy thing in the testsuite. Just use b"..." literals, they
also work in 2.6.

> PPS: In dbm (gdbm) I'm seeing automatic unicode->byte conversion, but NO
> byte->unicode. See the problem when storing non ASCII data:

You should file a bug if there isn't already one.

cheers and good luck with this,

Antoine.




More information about the Python-3000 mailing list