![](https://secure.gravatar.com/avatar/34074d36919afb11d5ed1b8e330a1e68.jpg?s=120&d=mm&r=g)
On 17/02/13 11:43, Armin Rigo wrote:
Hi,
On Tue, Feb 12, 2013 at 7:14 PM, Eleytherios Stamatogiannakis <estama@gmail.com> wrote:
Also we are looking into adding a special ffi.string_decode_UTF8 in CFFI's backend to reduce the number of calls that are needed to go from utf8_char* to PyPy's unicode.
A first note: I'm wondering why you need to convert from utf-8-that-contains-only-ascii, to unicode, and back. What is the point of having unicode strings in the first place? Can't you just pass around your complete program plain non-unicode strings?
The problem is that SQlite internally uses UTF-8. So you cannot know in advance if the char* that you get from it is plain ASCII or a UTF-8 encoded Unicode. So we end up always converting to Unicode from the char* that SQlite returns. When sending to it, we have different code paths for Python's str() and unicode() string representations. Unfortunately, due to the nature of our data (its multilingual), and to make our life easier when we code our relational operators (written in Python), we always convert to Unicode inside our operators. So the str() path inside the MSPW SQLite wrapper, mostly sits unused.
If not, then indeed, it would make (a bit of) sense to have ways to convert directly between "char *" and unicode strings, in both directions, assuming utf-8. This could be done with an API like:
ffi.encode_utf8(unicode_string) -> new_char*_cdata ffi.encode_utf8(unicode_string, target_char*_cdata, maximum_length) ffi.decode_utf8(char*_cdata, [length]) -> unicode_string
Alternatively, we could accept unicode strings whenever a "char*" is expected and encode it to utf-8, but that sounds a bit too magical.
An API like the one you propose would be very nice, and IMHO would give a substantial speedup. May i suggest, that for generality purposes, the same API functions should also be added for UTF-16, UTF-32 ? Thanks Armin and Maciej for looking into this, l.