Re: [pypy-dev] Unicode encode/decode speed

Feb. 18, 2013

      On 17/02/13 11:43, Armin Rigo wrote:
...
Hi,
On Tue, Feb 12, 2013 at 7:14 PM, Eleytherios Stamatogiannakis
<estama@gmail.com> wrote:
...
Also we are looking into adding a special ffi.string_decode_UTF8 in CFFI's
backend to reduce the number of calls that are needed to go from utf8_char*
to PyPy's unicode.
A first note: I'm wondering why you need to convert from
utf-8-that-contains-only-ascii, to unicode, and back.  What is the
point of having unicode strings in the first place?  Can't you just
pass around your complete program plain non-unicode strings?
The problem is that SQlite internally uses UTF-8. So you cannot know in 
advance if the char* that you get from it is plain ASCII or a UTF-8 
encoded Unicode. So we end up always converting to Unicode from the 
char* that SQlite returns.

When sending to it, we have different code paths for Python's str() and 
unicode() string representations. Unfortunately, due to the nature of 
our data (its multilingual), and to make our life easier when we code 
our relational operators (written in Python), we always convert to 
Unicode inside our operators. So the str() path inside the MSPW SQLite 
wrapper, mostly sits unused.
...
If not, then indeed, it would make (a bit of) sense to have ways to
convert directly between "char *" and unicode strings, in both
directions, assuming utf-8.  This could be done with an API like:
ffi.encode_utf8(unicode_string) -> new_char*_cdata
ffi.encode_utf8(unicode_string, target_char*_cdata, maximum_length)
ffi.decode_utf8(char*_cdata, [length]) -> unicode_string
Alternatively, we could accept unicode strings whenever a "char*" is
expected and encode it to utf-8, but that sounds a bit too magical.
An API like the one you propose would be very nice, and IMHO would give 
a substantial speedup.

May i suggest, that for generality purposes, the same API functions 
should also be added for UTF-16, UTF-32 ?

Thanks Armin and Maciej for looking into this,

l.