[Python-Dev] Internal representation of strings and Micropython
Paul Sokolovsky
pmiscml at gmail.com
Fri Jun 6 16:52:17 CEST 2014
Hello,
On Fri, 6 Jun 2014 21:48:41 +1000
Tim Delaney <timothy.c.delaney at gmail.com> wrote:
> On 6 June 2014 21:34, Paul Sokolovsky <pmiscml at gmail.com> wrote:
>
> >
> > On Fri, 06 Jun 2014 20:11:27 +0900
> > "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> >
> > > Paul Sokolovsky writes:
> > >
> > > > That kinda means "string is atomic", instead of your
> > > > "characters are atomic".
> > >
> > > I would be very surprised if a language that behaved that way was
> > > called a "Python subset". No indexing, no slicing, no regexps, no
> > > .split(), no .startswith(), no sorted() or .sort(), ...!?
> > >
> > > If that's not what you mean by "string is atomic", I think you're
> > > using very confusing terminology.
> >
> > I'm sorry if I didn't mention it, or didn't make it clear enough -
> > it's all about layering.
> >
> > On level 0, you treat strings verbatim, and can write some subset of
> > apps (my point is that even this level allows to write lot enough
> > apps). Let's call this set A0.
> >
> > On level 1, you accept that there's some universal enough
> > conventions for some chars, like space or newline. And you can
> > write set of apps A1 > A0.
> >
>
> At heart, this is exactly what the Python 3 "str" type is. The
> universal convention is "code points".
Yes. Except for one small detail - Python3 specifies these code points
to be Unicode code points. And Unicode is a very bloated thing.
But if we drop that "Unicode" stipulation, then it's also exactly what
MicroPython implements. Its "str" type consists of codepoints, we don't
have pet names for them yet, like Unicode does, but their numeric
values are 0-255. Note that it in no way limits encodings, characters,
or scripts which can be used with MicroPython, because just like
Unicode, it support concept of "surrogate pairs" (but we don't call it
like that) - specifically, smaller code points may comprise bigger
groupings. But unlike Unicode, we don't stipulate format, value or
other constraints on how these "surrogate pairs"-alikes are formed,
leaving that to users.
--
Best regards,
Paul mailto:pmiscml at gmail.com
More information about the Python-Dev
mailing list