[Python-Dev] Internal representation of strings and Micropython

Nick Coghlan ncoghlan at gmail.com
Sat Jun 7 02:37:28 CEST 2014


On 7 Jun 2014 00:53, "Paul Sokolovsky" <pmiscml at gmail.com> wrote:
>
> Yes. Except for one small detail - Python3 specifies these code points
> to be Unicode code points. And Unicode is a very bloated thing.

I rather suspect users of East Asian & African scripts might have a
different notion of what constitutes "bloated" vs "can actually represent
this language properly, unlike 8-bit code spaces".

> But if we drop that "Unicode" stipulation, then it's also exactly what
> MicroPython implements. Its "str" type consists of codepoints, we don't
> have pet names for them yet, like Unicode does, but their numeric
> values are 0-255. Note that it in no way limits encodings, characters,
> or scripts which can be used with MicroPython, because just like
> Unicode, it support concept of "surrogate pairs" (but we don't call it
> like that) - specifically, smaller code points may comprise bigger
> groupings. But unlike Unicode, we don't stipulate format, value or
> other constraints on how these "surrogate pairs"-alikes are formed,
> leaving that to users.

This is effectively what the Python 2 str type does, and it's a recipe for
data driven latent defects. You inevitably end up concatenating strings
using different code spaces, or else splitting strings between surrogate
pairs rather than on the proper boundaries, etc.

The abstraction presented to users by the str type *must* be the full range
of Unicode code points as atomic units. Storing those internally as UTF-8
rather than as fixed width code points as CPython does is an experiment
worth trying, since you don't have the same C level backwards compatibility
constraints we do. But limiting the str type to a single code page per
process is not an acceptable constraint in a Python 3 implementation.

Regards,
Nick.

>
>
> --
> Best regards,
>  Paul                          mailto:pmiscml at gmail.com
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140607/2e960d1e/attachment.html>


More information about the Python-Dev mailing list