[pypy-dev] Unicode encode/decode speed

Sun Feb 17 10:43:45 CET 2013

Hi,

On Tue, Feb 12, 2013 at 7:14 PM, Eleytherios Stamatogiannakis
<estama at gmail.com> wrote:
> Also we are looking into adding a special ffi.string_decode_UTF8 in CFFI's
> backend to reduce the number of calls that are needed to go from utf8_char*
> to PyPy's unicode.

A first note: I'm wondering why you need to convert from
utf-8-that-contains-only-ascii, to unicode, and back.  What is the
point of having unicode strings in the first place?  Can't you just
pass around your complete program plain non-unicode strings?

If not, then indeed, it would make (a bit of) sense to have ways to
convert directly between "char *" and unicode strings, in both
directions, assuming utf-8.  This could be done with an API like:

ffi.encode_utf8(unicode_string) -> new_char*_cdata
ffi.encode_utf8(unicode_string, target_char*_cdata, maximum_length)
ffi.decode_utf8(char*_cdata, [length]) -> unicode_string

Alternatively, we could accept unicode strings whenever a "char*" is
expected and encode it to utf-8, but that sounds a bit too magical.

A bientôt,

Armin.