[Python-Dev] Internal representation of strings and Micropython
Steven D'Aprano
steve at pearwood.info
Wed Jun 4 16:12:45 CEST 2014
On Wed, Jun 04, 2014 at 01:14:04PM +0000, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an
> attempted optimization that is likely to be incorrect for a set of
> strings.
I'm afraid I don't understand this argument. The language semantics says
that a string is an array of code points. Every index relates to a
single code point, no code point extends over two or more indexes.
There's a 1:1 relationship between code points and indexes. How is
direct indexing "likely to be incorrect"?
e.g.
s = "---ÿ---"
offset = s.index('ÿ')
assert s[offset] == 'ÿ'
That cannot fail with Python's semantics.
[Aside: it does fail in Python 2, showing that the idea that "strings
are bytes" is fatally broken. Fortunately Python has moved beyond that.]
> Splitting, regex, concatenation and formatting are really the
> main operations that matter, and MicroPython can optimize their
> implementation of these easily enough for O(N) indexing.
Really? Well, it will be a nice experiment. Fortunately MicroPython runs
under Linux as well as on embedded systems (a clever decision, by the
way) so I look forward to seeing how their internal-utf8 implementation
stacks up against CPython's FSR implementation.
Out of curiosity, when the FSR was proposed, did anyone consider an
internal UTF-8 representation? If so, why was it rejected?
--
Steven
More information about the Python-Dev
mailing list