[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Mon Mar 7 03:48:53 EST 2016

Hi,

On 07/03/16 08:58, hubo wrote:
> I think it is not reasonable to use UTF-8 to represent the unicode
> string type.
> 1. Less storage - this is not always true. It is only true for strings
> with a lot of ASCII characters. In Asia, most strings in local languages
> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume
> more storage than in UTF-16. To make things worse, while it always
> consumes 2*N bytes for a N-characters string in UTF-16, it is difficult
> to estimate the size of a N-characters string in UTF-8 (may be N bytes
> to 3 * N bytes)
> (UTF-16 also has two-word characters, but len() reports 2 for these
> characters, I think it is not harmful to treat them as two characters)

Note that in PyPy unicode strings use UTF-32 as the internal
representation for all platforms, so the space saving would be larger.

Note also that currently almost all I/O operations on many platforms do 
a conversion from UTF-8 to UTF-32 and back, which involves a copy and is 
costly.

> 2. There would be very complicate logics for size calculating and
> slicing. For UTF-16, every character is represented with a 16-bit
> integer, so it is convient for size calculating and slicing. But
> character in UTF-8 consumes variant bytes, so either we call mb_* string
> functions instead (which is slow in nature) or we use special logic like
> storing indices of characters in another array (which introduces cost
> for extra addressings).

This is true, some engineering would have to go into this part of the
representation.

> 3. When displaying with repr(), non-ASCII characters are displayed with
> \uXXXX format. If the internal storage for unicode is UTF-8, the only
> way to be compatible with this format is to convert it back to UTF-16.
> It may be wiser to let programmers deside which encoding they would like
> to use. If they want to process UTF-8 strings without performance cost
> on converting, they should use "bytes". When correct size calculating
> and slicing of non-ASCII characters are concerned it may be better to
> use "unicode".

I think repr is allowed to be a somewhat slow operation.

Cheers,

Carl Friedrich