[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
Armin Rigo
arigo at tunes.org
Tue Mar 8 10:16:21 EST 2016
Hi hubo,
On 7 March 2016 at 13:49, hubo <hubo at jiedaibao.com> wrote:
> I think in Python 3.x, u'\ud805\udc09' is not another format of
> u'\U00011409', it is just an illegal unicode string. It also raises
> UnicodeEncodeError if you try to encode it into UTF-8. The problem is that
> it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as
> the internal storage format, I don't think it is possible to keep these
> details same as CPython, but it should be acceptable.
We're good at keeping obscure details the same as CPython. It's only
a matter of adding the correct checks on top of the encode() and
decode() methods, independently of the underlying representation.
In this case, because we can consider the length-1 unicode string
u'\ud805', then we have to internally represent it somehow, and the
natural way would be to represent it as the 3 bytes '\xed\xa0\x85'.
So for u'\ud805\udc09' we use 6 bytes. Strictly speaking, we're thus
not using utf-8 internally, but
"utf-8-without-extra-consistency-checks". In Python 2,
u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a
single code point of 4 bytes. This means that calling
``decode('utf-8')`` has to check for surrogates, and do something more
complicated on Python 2.x (or complain on Python 3.x). In other
words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be
no-ops. Decoding and encoding need to check the data, and might
actually need to make a copy in corner cases, but not in the vast
majority of cases.
This is all focused on the web and generally Linux approach of "utf-8
everywhere". For Windows, the story is more complicated. CPython 2.x
uses UTF-16, like the Windows API. However, the recent CPython 3.x
moved anyway towards a variable-encoding model of UCS-4 (==UTF-32).
If you are on a recent CPython 3.x and build a unicode object with a
large codepoint, and then call the Windows API with it, it will need
anyway to convert it to UTF-16 dynamically, as far as I can
tell---i.e. convert from UCS-4 to UTF-16. In the proposal that is
discussed here, it would instead have to convert from
utf-8-without-extra-consistency-checks to UTF-16 in that situation.
There are definitely trade-offs to explore, but I doubt that we can
fully explore these trade-offs without actually trying it out.
A bientôt,
Armin.
More information about the pypy-dev
mailing list