[Python-3000] Bytes and unicode conversion in C extensions

Tue Jul 29 18:13:02 CEST 2008

Jesus Cea wrote:
>
> Working on the 3.0 version of bsddb, I have the following issue.
>
> Until 3.0, keys and values were strings. For bsddb, they are opaque, and
> stored unchanged.
>
> In 3.0 the string type is replaced by unicode. A new "byte" type is
> added. So, code like "db.put('key','value')" needs to be changed to
> "db.put(bytes('key', 'utf-8'), bytes('value', 'utf-8'))", or something
> similar.
>
> This is ugly and generates incompatible code with previous python releases.
>
> I was wondering what to do. The obvious path would be to put a proxy
> object between application code and bsddb, doing the byte<->unicode
> translation on the fly. This could be problematic when dealing with
> legacy data, since it couldn't be a valid encoded bytestring. Data
> misspresentation would be dangerous and can go undetected for a long
> time, slowly corrupting the database data.
>
> Moreover, the data is application specific, so automatic conversion can
> introduce incompatibilities and bugs.
>
> Another approach would be to add a new bsddb method to specify the
> default encoding to use to convert unicode->bytes, and to do the
> conversion internally when getting unicode data as a parameter. The
> issue here is that "u'hi' != b'hi'", so the translation must be done
> both when storing and when retrieving data.
>
> These problems are caused because now string!=bytes. In fact the
> approach in 3.0 is the right one, and any try to hide this difference
> with proxy objects or automatic conversion is going to bite us, someday.
>
> So, I'm thinking seriously in accepting *ONLY* "bytes" in the bsddb API
> (when working under Python 3.0), and do the proxy thing *ONLY* in the
> testsuite, to be able to reuse it.
>
> What do you think?.
>
> PS: Since most of the time keys/values are 7bit, a direct "ascii"
> encoding would be fine... until we are required to store a 8 bit value.

I propose to do something similar to the io.open() function:
add two parameters, 'encoding' and 'errors', that default to "ascii"
and "strict".
Then do the conversions, and raise exceptions on every failure...

-- 
Amaury Forgeot d'Arc