On 7 Jun 2014 00:53, "Paul Sokolovsky" <pmiscml@gmail.com> wrote:
Yes. Except for one small detail - Python3 specifies these code points to be Unicode code points. And Unicode is a very bloated thing.
I rather suspect users of East Asian & African scripts might have a different notion of what constitutes "bloated" vs "can actually represent this language properly, unlike 8-bit code spaces".
But if we drop that "Unicode" stipulation, then it's also exactly what MicroPython implements. Its "str" type consists of codepoints, we don't have pet names for them yet, like Unicode does, but their numeric values are 0-255. Note that it in no way limits encodings, characters, or scripts which can be used with MicroPython, because just like Unicode, it support concept of "surrogate pairs" (but we don't call it like that) - specifically, smaller code points may comprise bigger groupings. But unlike Unicode, we don't stipulate format, value or other constraints on how these "surrogate pairs"-alikes are formed, leaving that to users.
This is effectively what the Python 2 str type does, and it's a recipe for data driven latent defects. You inevitably end up concatenating strings using different code spaces, or else splitting strings between surrogate pairs rather than on the proper boundaries, etc. The abstraction presented to users by the str type *must* be the full range of Unicode code points as atomic units. Storing those internally as UTF-8 rather than as fixed width code points as CPython does is an experiment worth trying, since you don't have the same C level backwards compatibility constraints we do. But limiting the str type to a single code page per process is not an acceptable constraint in a Python 3 implementation. Regards, Nick.
-- Best regards, Paul mailto:pmiscml@gmail.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com