[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Tue Mar 8 10:16:21 EST 2016

Hi hubo,

On 7 March 2016 at 13:49, hubo <hubo at jiedaibao.com> wrote:
> I think in Python 3.x, u'\ud805\udc09' is not another format of
> u'\U00011409', it is just an illegal unicode string. It also raises
> UnicodeEncodeError if you try to encode it into UTF-8. The problem is that
> it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as
> the internal storage format, I don't think it is possible to keep these
> details same as CPython, but it should be acceptable.

We're good at keeping obscure details the same as CPython.  It's only
a matter of adding the correct checks on top of the encode() and
decode() methods, independently of the underlying representation.

In this case, because we can consider the length-1 unicode string
u'\ud805', then we have to internally represent it somehow, and the
natural way would be to represent it as the 3 bytes '\xed\xa0\x85'.
So for u'\ud805\udc09' we use 6 bytes.  Strictly speaking, we're thus
not using utf-8 internally, but
"utf-8-without-extra-consistency-checks".  In Python 2,
u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a
single code point of 4 bytes.  This means that calling
``decode('utf-8')`` has to check for surrogates, and do something more
complicated on Python 2.x (or complain on Python 3.x).  In other
words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be
no-ops.  Decoding and encoding need to check the data, and might
actually need to make a copy in corner cases, but not in the vast
majority of cases.

This is all focused on the web and generally Linux approach of "utf-8
everywhere".  For Windows, the story is more complicated.  CPython 2.x
uses UTF-16, like the Windows API.  However, the recent CPython 3.x
moved anyway towards a variable-encoding model of UCS-4 (==UTF-32).
If you are on a recent CPython 3.x and build a unicode object with a
large codepoint, and then call the Windows API with it, it will need
anyway to convert it to UTF-16 dynamically, as far as I can
tell---i.e. convert from UCS-4 to UTF-16.  In the proposal that is
discussed here, it would instead have to convert from
utf-8-without-extra-consistency-checks to UTF-16 in that situation.

There are definitely trade-offs to explore, but I doubt that we can
fully explore these trade-offs without actually trying it out.

A bientôt,

Armin.