Mailman 3 Add a line_offsets() method to str - Python-ideas

June 17, 2022

      Hi everyone,

Today was the 3rd time I came across a situation where it was needed to
retrieve all the positions of the line endings (or beginnings) in a very
long python string as efficiently as possible. First time, it was needed in
prompt_toolkit, where I spent a crazy amount of time looking for the most
performant solution. Second time was in a commercial project where
performance was very critical too. Third time is for the Rich/Textual
project from Will McGugan. (See:
https://twitter.com/willmcgugan/status/1537782771137011715 )

The problem is that the `str` type doesn't expose any API to efficiently
find all \n positions. Every Python implementation is either calling
`.index()` in a loop and collecting the results or running a regex over the
string and collecting all positions.

For long strings, depending on the implementation, this results in a lot of
overhead due to either:
- calling Python functions (or any other Python instruction) for every \n
character in the input. The amount of executed Python instructions is O(n)
here.
- Copying string data into new strings.

The fastest solution I've been using for some time, does this (simplified):
`accumulate(chain([0], map(len, text.splitlines(True))))`. The performance
is great here, because the amount of Python instructions is O(1).
Everything is chained in C-code thanks to itertools. Because of that, it
can outperform the regex solution with a factor of ~2.5. (Regex isn't slow,
but iterating over the results is.)

The bad things about this solution is however:
- Very cumbersome syntax.
- We call `splitlines()` which internally allocates a huge amount of
strings, only to use their lengths. That is still much more overhead then a
simple for-loop in C would be.

Performance matters here, because for these kind of problems, the list of
integers that gets produced is typically used as an index to quickly find
character offsets in the original string, depending on which line is
displayed/processed. The bisect library helps too to quickly convert any
index position of that string into a line number. The point is, that for
big inputs, the amount of Python instructions executed is not O(n), but
O(1). Of course, some of the C code remains O(n).

So, my ask here.
Would it make sense to add a `line_offsets()` method to `str`?
Or even `character_offsets(character)` if we want to do that for any
character?
Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary
lengths?

Thanks,
Jonathan

Add a line_offsets() method to str

tags

participants (12)