[Python-Dev] Internal representation of strings and Micropython

Wed Jun 4 17:38:31 CEST 2014

Hello,

On Wed, 04 Jun 2014 17:40:14 +0300
Serhiy Storchaka <storchaka at gmail.com> wrote:

> 04.06.14 17:02, Paul Moore написав(ла):
> > On 4 June 2014 14:39, Serhiy Storchaka <storchaka at gmail.com> wrote:
> >> I think than breaking O(1) expectation for indexing makes the
> >> implementation significant incompatible with Python. Virtually all
> >> string operations in Python operates with indices.
> >
> > I don't use indexing on strings except in rare situations. Sure I
> > use lots of operations that may well use indexing *internally* but
> > that's the point. MicroPython can optimise those operations without
> > needing to guarantee O(1) indexing, and I'd be fine with that.
> 
> Any non-trivial text parsing uses indices or regular expressions (and 
> regular expressions themself use indices internally).

I keep hearing this stuff, and unfortunately so far don't have enough
time to collect all that stuff and provide detailed response. So,
here's spur of the moment response - hopefully we're in the same
context so it is easy to understand.

So, gentlemen, you keep mixing up character-by-character random access
to string and taking substrings of a string.

Character-by-character random access imply that you would need to scan
thru (possibly almost) all chars in a string. That's O(N) (N-length of
string). With varlength encoding (taking O(N) to index arbitrary char),
there's thus concern that this would be O(N^2) op.

But show me real-world case for that. Common usecase is scanning string
left-to-right, that should be done using iterator and thus O(N).
Right-to-left scanning would be order(s) of magnitude less frequent, as
and also handled by iterator.

What's next? You're doing some funky anagrams and need to swap each 2
adjacent chars? Sorry, naive implementation will be slow. If you're in
serious anagram business, you'll need to code C extension. No, wait!
Instead you should learn Python better. You should run a string
windowing iterator which will return adjacent pair and swap those
constant-len strings.

More cases anyone? Implementing DES and doing arbitrary permutations?
Kindly drop doing that on strings, do it on bytes or lists.

Hopefully, the idea is clear - if you *scan* thru string using indexes
in *random* order, you're doing weird thing and *want* weird
performance. Doing stuff is s[0] ot s[-1] - there's finite (and small)
number of such operation per strings.

Now about taking substrings of strings (which in Python often expressed
by slice indexing). Well, this is quite different from scanning each
character of a strings. Just like s[0]/s[-1] this usually happens
finite number of times for a particular string, independent of its
length, i.e. O(1) times (ex, you take a string and split it in 3
parts), or maybe number of substrings is not bound-fixed, but has
different growth order, O(M) (for example, you split string in tokens,
tokens can be long, but there're usually external limits on how many
it's sensible to have on one line).

So, again, you're not going to get quadric time unless you're unlucky
or sloppy. And just again, you should brush up your Python skills and
use regex functions shich return iterators to get your parsed tokens,
etc.

(To clarify the obvious - "you" here is abstract pronoun, not referring
to respectable Python developers who actually made it possible to write
efficient Python programs).

So, hopefully the point is conveyed - you can write inefficient Python
programs. CPython goes out of the way to hide many inefficiencies (using
unbelievably bloated heap usage - from uPy's point of view, which
starts up in 2K heap). You just shouldn't write inefficient programs,
voila. But if you want, you can keep writing inefficient programs, they
just will be inefficient. Peace.

> It would be interesting to collect a statistic about how many
> indexing operations happened during the life of a string in typical
> (Micro)Python program.

Yup.

-- 
Best regards,
 Paul                          mailto:pmiscml at gmail.com