On 7 June 2014 00:52, Paul Sokolovsky <pmiscml@gmail.com> wrote:

> At heart, this is exactly what the Python 3 "str" type is. The
> universal convention is "code points".

Yes. Except for one small detail - Python3 specifies these code points
to be Unicode code points. And Unicode is a very bloated thing.

But if we drop that "Unicode" stipulation, then it's also exactly what
MicroPython implements. Its "str" type consists of codepoints, we don't
have pet names for them yet, like Unicode does, but their numeric
values are 0-255. Note that it in no way limits encodings, characters,
or scripts which can be used with MicroPython, because just like
Unicode, it support concept of "surrogate pairs" (but we don't call it
like that) - specifically, smaller code points may comprise bigger
groupings. But unlike Unicode, we don't stipulate format, value or
other constraints on how these "surrogate pairs"-alikes are formed,
leaving that to users.

I think you've missed my point.

There is absolutely nothing conceptually bloaty about what a Python 3 string is. It's just like a 7-bit ASCII string, except each entry can be from a larger table. When you index into a Python 3 string, you get back exactly *one valid entry* from the Unicode code point table. That plus the length of the string, plus the guarantee of immutability gives everything needed to layer the rest of the string functionality on top.

There are no surrogate pairs - each code point is standalone (unlike code *units*). It is conceptually very simple. The implementation may be difficult (if you're trying to do better than 4 bytes per code point) but the concept is dead simple.

If the MicroPython string type requires people *using* it to deal with surrogates (i.e. indexing could return a value that is not a valid Unicode code point) then it will have broken the conceptual simplicity of the Python 3 string type (and most certainly can't be considered in any way compatible).

Tim Delaney