[pypy-dev] CFFI speed results

Sat Dec 15 12:14:04 CET 2012

On 12/15/2012 11:27 AM Armin Rigo wrote:
> Hi,
>
> On Sat, Dec 15, 2012 at 7:51 AM, Maciej Fijalkowski<fijall at gmail.com>  wrote:
>> And ASPW does the same right? I understand the general need for UTF8,
>> I just didn't find it in this particular query.
>
> Fwiw, I wonder again if we couldn't have all our unicode strings
> internally be UTF8 instead of 2- or 4-bytes strings.  This would mean
> a W_UTF8UnicodeObject class that has both a reference to the RPython
> string and some optional extra data to make it faster to locate the
> n'th character or the total unicode length.  (We discussed it on IRC
> some time ago.)
>
>
> A bientôt,
>
> Armin.

Since

     >>> for i in range(256): assert chr(i).decode('latin1') == unichr(i)

I wonder whether something could be gained by having an alternative
internal unicode representation in the form of latin1 8-bit byte strings.

ISTM a lot of English speaking and western European locales would hardly
ever need anything else, and generating code to tag and use/transform
alternative representations would be an internal optimization matter.

I suppose some apps could well result in 8, 16, and 32-bit unicodes and utf8
all coexisting under the hood, but only when actually needed.

Regards,
Bengt Richter