On 6 June 2014 21:34, Paul Sokolovsky <pmiscml@gmail.com> wrote:

On Fri, 06 Jun 2014 20:11:27 +0900
"Stephen J. Turnbull" <stephen@xemacs.org> wrote:

> Paul Sokolovsky writes:
>
> > That kinda means "string is atomic", instead of your "characters
> > are atomic".
>
> I would be very surprised if a language that behaved that way was
> called a "Python subset". No indexing, no slicing, no regexps, no
> .split(), no .startswith(), no sorted() or .sort(), ...!?
>
> If that's not what you mean by "string is atomic", I think you're
> using very confusing terminology.

I'm sorry if I didn't mention it, or didn't make it clear enough - it's
all about layering.

On level 0, you treat strings verbatim, and can write some subset of
apps (my point is that even this level allows to write lot enough
apps). Let's call this set A0.

On level 1, you accept that there's some universal enough conventions
for some chars, like space or newline. And you can write set of
apps A1 > A0.

At heart, this is exactly what the Python 3 "str" type is. The universal convention is "code points". It's got nothing to do with encodings, or bytes. A Python string is simply a finite sequence of atomic code points - it is indexable, and it has a length. Once you have that, everything is layered on top of it. How the code points themselves are implemented is opaque and irrelevant other than the memory and performance consequences of the implementation decisions (for example, a string could be indexable by iterating from the start until you find the nth code point).

Similarly the "bytes" type is a sequence of 8-bit bytes.

Encodings are simply a way to transport code points via a byte-oriented transport.

Tim Delaney