[Python-Dev] Internal representation of strings and Micropython

Thu Jun 5 14:21:30 CEST 2014

On 5 June 2014 22:01, Paul Sokolovsky <pmiscml at gmail.com> wrote:

>
> All these changes are what let me dream on and speculate on
> possibility that Python4 could offer an encoding-neutral string type
> (which means based on bytes)
>

To me, an "encoding neutral string type" means roughly "characters are
atomic", and the best representation we have for a "character" is a Unicode
code point. Through any interface that provides "characters" each
individual "character" (code point) is indivisible.

To me, Python 3 has exactly an "encoding-neutral string type". It also has
a bytes type that is is just that - bytes which can represent anything at
all.It might be the UTF-8 representation of a string, but you have the
freedom to manipulate it however you like - including making it no longer
valid UTF-8.

Whilst I think O(1) indexing of strings is important, I don't think it's as
important as the property that "characters" are indivisible and would be
quite happy for MicroPython to use UTF-8 as the underlying string
representation (or some more clever thing, several ideas in this thread) so
long as:

1. It maintains a string type that presents code points as indivisible
elements;

2. The performance consequences of using UTF-8 are documented, as well as
any optimisations, tricks, etc that are used to overcome those consequences
(and what impact if any they would have if code written for MicroPython was
run in CPython).

Cheers,

Tim Delaney
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140605/4ca05eb2/attachment.html>