Add a line_offsets() method to str

Hi everyone, Today was the 3rd time I came across a situation where it was needed to retrieve all the positions of the line endings (or beginnings) in a very long python string as efficiently as possible. First time, it was needed in prompt_toolkit, where I spent a crazy amount of time looking for the most performant solution. Second time was in a commercial project where performance was very critical too. Third time is for the Rich/Textual project from Will McGugan. (See: https://twitter.com/willmcgugan/status/1537782771137011715 ) The problem is that the `str` type doesn't expose any API to efficiently find all \n positions. Every Python implementation is either calling `.index()` in a loop and collecting the results or running a regex over the string and collecting all positions. For long strings, depending on the implementation, this results in a lot of overhead due to either: - calling Python functions (or any other Python instruction) for every \n character in the input. The amount of executed Python instructions is O(n) here. - Copying string data into new strings. The fastest solution I've been using for some time, does this (simplified): `accumulate(chain([0], map(len, text.splitlines(True))))`. The performance is great here, because the amount of Python instructions is O(1). Everything is chained in C-code thanks to itertools. Because of that, it can outperform the regex solution with a factor of ~2.5. (Regex isn't slow, but iterating over the results is.) The bad things about this solution is however: - Very cumbersome syntax. - We call `splitlines()` which internally allocates a huge amount of strings, only to use their lengths. That is still much more overhead then a simple for-loop in C would be. Performance matters here, because for these kind of problems, the list of integers that gets produced is typically used as an index to quickly find character offsets in the original string, depending on which line is displayed/processed. The bisect library helps too to quickly convert any index position of that string into a line number. The point is, that for big inputs, the amount of Python instructions executed is not O(n), but O(1). Of course, some of the C code remains O(n). So, my ask here. Would it make sense to add a `line_offsets()` method to `str`? Or even `character_offsets(character)` if we want to do that for any character? Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary lengths? Thanks, Jonathan

Hi This is a nice problem, well presented. Here's four comments / questions. 1. How does the introduction of faster CPython in Python 3.11 affect the benchmarks? 2. Is there an across-the-board change that would speedup this line-offsets task? 3. To limit splitlines memory use (at small performance cost), chunk the input string into say 4 kb blocks. 4. Perhaps anything done here for strings should also be done for bytes. -- Jonathan

As for the universal new lines— it seems that either converting when the file is read (default behavior) or a simple replace of “\r\n” first is a simple solution. I’m still confused about the use case though. It seems it involves large amounts of text, where you need to access individual lines, but not where a list-of-lines makes sense. I can see that. But I’m having trouble imagining a use case where you need that, and performance is critical for building the data structure. But if all those are requirements, I’d think a custom data structure would be ideal — one that was both a large single string, and a sequence (and iterable) of lines. Which reminds me of a ragged array, which I have implemented as an extension to numpy. (Both in pure python and Cython). Perhaps a C-implemented (or accelerated) class that does all this would be a nice third party package. And if proven useful, stdlib in the future. I imagine you wouldn’t want the dependency, but it would be interesting to benchmark a numpy solution. Numpy isn’t very memory efficient for strings (UCS-4), but it should be fast. Final note: it seems the regex solution for a single char is performant, but overly complicated. I’ve been very happy that I can do most anything with string methods, and rarely need to reach for regex. For something this simple, it would be nice to have a string method. -CHB On Sun, Jun 19, 2022 at 3:34 PM Jonathan Fine <jfine2358@gmail.com> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Thanks all for all the responses! That's quite a bit to think about. A couple of thoughts: 1. First, I do support a transition to UTF-8, so I understand we don't want to add more methods that deal with character offsets. (I'm familiar with how strings work in Rust.) However, does that mean we won't be using/exposing any offset at all, or will it become possible to slice using byte offsets? 2. The commercial application I mentioned where this is critical is actually using bytes instead of str. Sorry for not mentioning earlier. We were doing the following: list(accumulate(chain([0], map(len, text.splitlines(True))))) where text is a bytes object. This is significantly faster than a binary regex for finding all universal line endings. This application is an asyncio web app that streams Cisco show-tech files (often several gigabytes) from a file server over HTTP; stores them chunk by chunk into a local cache file on disk; and builds a index of byte offsets in the meantime by running the above expression over every chunk. That way the client web app can quickly load the lines from disk as the user scrolls through the file. A very niche application indeed, so use of Cython would be acceptable in this particular case. I published the relevant snippet here to be studied: https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868 It does handle an interesting edge case regarding UTF-16. 3. The code in prompt_toolkit can be found here: https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prom... (It's not yet using 'accumulate' there, but for the rest it's the same.) Also here, universal line endings support is important, because the editing buffer can in theory contain a mix of line endings. It has to be performant, because it executes on every key stroke. In this case, a more complex data structure could probably solve performance issues here, but it's really not worth the complexity that it introduces in every text manipulation (like every key binding). Also try using the "re" library to search over a list of lines or anything that's not a simple string. 4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5 times faster than re. Imagine if splitlines() doesn't have to do the work to actually create the substrings, but only has to return the offsets, that should be even much faster and not require so much memory. (I have an benchmark that does it one chunk at a time, to prevent using too much memory: https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf ) So talking about bytes. Would it be acceptable to have a `bytes.line_offsets()` method instead? Or `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or not? `str.splitlines(return_offsets=True)` would be very nice, but I understand the concerns. It's somewhat frustrating here knowing that for `splitlines()`, the information is there, already computed, just not immediately accessible. (without having Python do lots of unnecessary work.) Jonathan Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2358@gmail.com> a écrit :

If you are working with bytes, then numpy could be perfect— not a small dependency of course, but it should work, and work fast. And a cython method would be quite easy to write, but of course substantially harder to distribute :-( -CHB On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders <jonathan@slenders.be> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Hmm - I’m a bit confused about how you handle mixed / multiple line endings. If you use splitlines(), then it will remove the line endings, so if there are two-char line endings, then you’ll get off by one errors, yes? I would think you could look for “\n”, and get the correct answer ( with extraneous “\r”s in the substrings… -CHB On Mon, Jun 20, 2022 at 5:04 PM Christopher Barker <pythonchb@gmail.com> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

My first thought is that you are using the wrong data structure here— perhaps a list of lines would make more sense than one big ol’ string. That being said, a way to get all the indexes of a character could be useful I’d support that idea. -CHB On Fri, Jun 17, 2022 at 10:13 PM Jonathan Slenders <jonathan@slenders.be> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

I'm a little confused by the benchmark. Using re looks pretty competitive in terms of speed, and should be much more memory efficient. # https://www.gutenberg.org/cache/epub/100/pg100.txt (5.7mb; ~170K lines) with open('/tmp/shakespeare.txt', 'r') as f: text = f.read() import re from itertools import * line_re = re.compile(r"\n") Then when I run it: In [25]: %timeit _ = list(accumulate(chain([0], map(len, text.splitlines(True))))) 30.4 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [26]: %timeit _ = [m.start() for m in line_re.finditer(text)] 29 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) This is on 3.10.3 on an Intel 2.3gz i9 Macbook. (Note that the regex is off-by-one from the splitlines implementation.) What benchmark shows the regex to be significantly slower? That said, str.indexes(char) sounds like a reasonable addition. Best wishes, Lucas Wiman On Fri, Jun 17, 2022 at 1:12 PM Jonathan Slenders <jonathan@slenders.be> wrote:

Good catch! One correction here, I somewhat mixed up the benchmarks. I forgot both projects of mine required support for universal line endings exactly like splitlines() does this out of the box. That requires a more complex regex pattern. I was actually using: re.compile(r"\n|\r(?!\n)") And then the regex becomes significantly slower than the splitlines() solution, which is still much slower than it has to be. This makes me realize that `str.indexes(char)` is actually not what I need, but really a `str.line_offsets()` which returns exactly the positions that `str.splitlines()` would use. Does that make sense? If this is reasonable, I wouldn't mind working on the implementation. (@Christophe: In Python, a single string as a data structure is often much easier to deal with and overall extremely performant. Try searching over a list of lines.) Thanks, Jonathan Le sam. 18 juin 2022 à 21:09, Lucas Wiman <lucas.wiman@gmail.com> a écrit :

On Sat, 18 Jun 2022 at 21:57, Jonathan Slenders <jonathan@slenders.be> wrote:
At this point, I'm inclined to say that this is too specialised to be a string method, and it might be better as a 3rd party utility. I imagine there are problems with that, not least the issue of having to depend on a compiled extension that may not be available on all platforms. But as an *optional* dependency (with a fallback to the current splitlines approach) is it really such a problem? Sorry to be the one who's always suggesting this, but I'm genuinely interested in why a 3rd party solution isn't sufficient. After all, it has the advantage of working on older versions of Python (and given that one of your use cases is Textual, I can't imagine anyone would be happy if that required Python 2.12+...) If the proposal had still been for a general str.indexes(char), I might have thought differently (although given that re is fast when searching for a single character, that may not be worthwhile either). Paul

This makes me realize that `str.indexes(char)` is actually not what I need, but really a `str.line_offsets()` which returns exactly the positions that `str.splitlines()` would use. Does that make sense?
I'm also thinking of a generic `str.split_indices(char)` that handles all characters like `str.split(char)`, though I can't find any good use cases for it, so I won't recommend it for now. `str.line_offsets()` does seem like a good idea to me though.

Jonathan Slenders writes:
I can't remember ever seeing an application where such a method is not only useful but needs to be extremely efficient, so I'm a little suspicious of the generality of the need. On the other hand, for this particular case you should be able to just subtract the string allocations from .splitlines. How about generalizing it into a maplines method, whose basic internal function is to generate an iterator over (line_start, line_end) pairs (line_end is useful because the separator is not necessarily a single character). Then it could optionally take a function to map over those, as well as an optional separator string. I suppose it would usually be a single character, except for universal newlines, which would be the default case both because it's so common and because it's not a constant string. This would then be the implementation of splitlines, as well as a mapping tool for records (usually lines, but could be paragraphs) in a string.

On Sat, Jun 18, 2022 at 5:13 AM Jonathan Slenders <jonathan@slenders.be> wrote:
First time, it was needed in prompt_toolkit, where I spent a crazy amount of time looking for the most performant solution. Third time is for the Rich/Textual project from Will McGugan. (See: https://twitter.com/willmcgugan/status/1537782771137011715 )
Would you give me a pointer to the code? I want to know its use cases.
FWIW, I had proposed str.iterlines() to fix incompatibility between IO.readlines() and str.splitlines(). That will be much efficient than splitlines because it doesn't allocate a huge amount of strings at once. It allocates line string at a time. https://discuss.python.org/t/changing-str-splitlines-to-match-file-readlines... Of course, it will be still slower than your line_offsets() idea because it still need to allocate line strings many times.
I don't like string offsets so I don't like adding more methods returning offsets. Currently, string offset in Python is counted by code points. This is not efficient for Python implementations using UTF-8 (PyPy and MicroPython) or UTF-16 (Maybe Jython and IronPython, but I don't know) internally. CPython uses PEP 393 for now so offsets are efficient. But I want to change it to UTF-8 in the future. UTF-8 internal encoding is much efficient for many Python use cases like: * Read UTF-8 string from text and write it to UTF-8 console. * Read UTF-8 from database and write it to UTF-8 JSON. Additionally, there are many very fast string algorithms work with UTF-8 written in C or Rust. Python is glue language. Reducing overhead with such libraries is good for Python. For now, my recommendation is using some library written in Cython if it is performance critical. Regards, -- Inada Naoki <songofacandy@gmail.com>

Hi This is a nice problem, well presented. Here's four comments / questions. 1. How does the introduction of faster CPython in Python 3.11 affect the benchmarks? 2. Is there an across-the-board change that would speedup this line-offsets task? 3. To limit splitlines memory use (at small performance cost), chunk the input string into say 4 kb blocks. 4. Perhaps anything done here for strings should also be done for bytes. -- Jonathan

As for the universal new lines— it seems that either converting when the file is read (default behavior) or a simple replace of “\r\n” first is a simple solution. I’m still confused about the use case though. It seems it involves large amounts of text, where you need to access individual lines, but not where a list-of-lines makes sense. I can see that. But I’m having trouble imagining a use case where you need that, and performance is critical for building the data structure. But if all those are requirements, I’d think a custom data structure would be ideal — one that was both a large single string, and a sequence (and iterable) of lines. Which reminds me of a ragged array, which I have implemented as an extension to numpy. (Both in pure python and Cython). Perhaps a C-implemented (or accelerated) class that does all this would be a nice third party package. And if proven useful, stdlib in the future. I imagine you wouldn’t want the dependency, but it would be interesting to benchmark a numpy solution. Numpy isn’t very memory efficient for strings (UCS-4), but it should be fast. Final note: it seems the regex solution for a single char is performant, but overly complicated. I’ve been very happy that I can do most anything with string methods, and rarely need to reach for regex. For something this simple, it would be nice to have a string method. -CHB On Sun, Jun 19, 2022 at 3:34 PM Jonathan Fine <jfine2358@gmail.com> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Thanks all for all the responses! That's quite a bit to think about. A couple of thoughts: 1. First, I do support a transition to UTF-8, so I understand we don't want to add more methods that deal with character offsets. (I'm familiar with how strings work in Rust.) However, does that mean we won't be using/exposing any offset at all, or will it become possible to slice using byte offsets? 2. The commercial application I mentioned where this is critical is actually using bytes instead of str. Sorry for not mentioning earlier. We were doing the following: list(accumulate(chain([0], map(len, text.splitlines(True))))) where text is a bytes object. This is significantly faster than a binary regex for finding all universal line endings. This application is an asyncio web app that streams Cisco show-tech files (often several gigabytes) from a file server over HTTP; stores them chunk by chunk into a local cache file on disk; and builds a index of byte offsets in the meantime by running the above expression over every chunk. That way the client web app can quickly load the lines from disk as the user scrolls through the file. A very niche application indeed, so use of Cython would be acceptable in this particular case. I published the relevant snippet here to be studied: https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868 It does handle an interesting edge case regarding UTF-16. 3. The code in prompt_toolkit can be found here: https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prom... (It's not yet using 'accumulate' there, but for the rest it's the same.) Also here, universal line endings support is important, because the editing buffer can in theory contain a mix of line endings. It has to be performant, because it executes on every key stroke. In this case, a more complex data structure could probably solve performance issues here, but it's really not worth the complexity that it introduces in every text manipulation (like every key binding). Also try using the "re" library to search over a list of lines or anything that's not a simple string. 4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5 times faster than re. Imagine if splitlines() doesn't have to do the work to actually create the substrings, but only has to return the offsets, that should be even much faster and not require so much memory. (I have an benchmark that does it one chunk at a time, to prevent using too much memory: https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf ) So talking about bytes. Would it be acceptable to have a `bytes.line_offsets()` method instead? Or `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or not? `str.splitlines(return_offsets=True)` would be very nice, but I understand the concerns. It's somewhat frustrating here knowing that for `splitlines()`, the information is there, already computed, just not immediately accessible. (without having Python do lots of unnecessary work.) Jonathan Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2358@gmail.com> a écrit :

If you are working with bytes, then numpy could be perfect— not a small dependency of course, but it should work, and work fast. And a cython method would be quite easy to write, but of course substantially harder to distribute :-( -CHB On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders <jonathan@slenders.be> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Hmm - I’m a bit confused about how you handle mixed / multiple line endings. If you use splitlines(), then it will remove the line endings, so if there are two-char line endings, then you’ll get off by one errors, yes? I would think you could look for “\n”, and get the correct answer ( with extraneous “\r”s in the substrings… -CHB On Mon, Jun 20, 2022 at 5:04 PM Christopher Barker <pythonchb@gmail.com> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

My first thought is that you are using the wrong data structure here— perhaps a list of lines would make more sense than one big ol’ string. That being said, a way to get all the indexes of a character could be useful I’d support that idea. -CHB On Fri, Jun 17, 2022 at 10:13 PM Jonathan Slenders <jonathan@slenders.be> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

I'm a little confused by the benchmark. Using re looks pretty competitive in terms of speed, and should be much more memory efficient. # https://www.gutenberg.org/cache/epub/100/pg100.txt (5.7mb; ~170K lines) with open('/tmp/shakespeare.txt', 'r') as f: text = f.read() import re from itertools import * line_re = re.compile(r"\n") Then when I run it: In [25]: %timeit _ = list(accumulate(chain([0], map(len, text.splitlines(True))))) 30.4 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [26]: %timeit _ = [m.start() for m in line_re.finditer(text)] 29 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) This is on 3.10.3 on an Intel 2.3gz i9 Macbook. (Note that the regex is off-by-one from the splitlines implementation.) What benchmark shows the regex to be significantly slower? That said, str.indexes(char) sounds like a reasonable addition. Best wishes, Lucas Wiman On Fri, Jun 17, 2022 at 1:12 PM Jonathan Slenders <jonathan@slenders.be> wrote:

Good catch! One correction here, I somewhat mixed up the benchmarks. I forgot both projects of mine required support for universal line endings exactly like splitlines() does this out of the box. That requires a more complex regex pattern. I was actually using: re.compile(r"\n|\r(?!\n)") And then the regex becomes significantly slower than the splitlines() solution, which is still much slower than it has to be. This makes me realize that `str.indexes(char)` is actually not what I need, but really a `str.line_offsets()` which returns exactly the positions that `str.splitlines()` would use. Does that make sense? If this is reasonable, I wouldn't mind working on the implementation. (@Christophe: In Python, a single string as a data structure is often much easier to deal with and overall extremely performant. Try searching over a list of lines.) Thanks, Jonathan Le sam. 18 juin 2022 à 21:09, Lucas Wiman <lucas.wiman@gmail.com> a écrit :

On Sat, 18 Jun 2022 at 21:57, Jonathan Slenders <jonathan@slenders.be> wrote:
At this point, I'm inclined to say that this is too specialised to be a string method, and it might be better as a 3rd party utility. I imagine there are problems with that, not least the issue of having to depend on a compiled extension that may not be available on all platforms. But as an *optional* dependency (with a fallback to the current splitlines approach) is it really such a problem? Sorry to be the one who's always suggesting this, but I'm genuinely interested in why a 3rd party solution isn't sufficient. After all, it has the advantage of working on older versions of Python (and given that one of your use cases is Textual, I can't imagine anyone would be happy if that required Python 2.12+...) If the proposal had still been for a general str.indexes(char), I might have thought differently (although given that re is fast when searching for a single character, that may not be worthwhile either). Paul

This makes me realize that `str.indexes(char)` is actually not what I need, but really a `str.line_offsets()` which returns exactly the positions that `str.splitlines()` would use. Does that make sense?
I'm also thinking of a generic `str.split_indices(char)` that handles all characters like `str.split(char)`, though I can't find any good use cases for it, so I won't recommend it for now. `str.line_offsets()` does seem like a good idea to me though.

Jonathan Slenders writes:
I can't remember ever seeing an application where such a method is not only useful but needs to be extremely efficient, so I'm a little suspicious of the generality of the need. On the other hand, for this particular case you should be able to just subtract the string allocations from .splitlines. How about generalizing it into a maplines method, whose basic internal function is to generate an iterator over (line_start, line_end) pairs (line_end is useful because the separator is not necessarily a single character). Then it could optionally take a function to map over those, as well as an optional separator string. I suppose it would usually be a single character, except for universal newlines, which would be the default case both because it's so common and because it's not a constant string. This would then be the implementation of splitlines, as well as a mapping tool for records (usually lines, but could be paragraphs) in a string.

On Sat, Jun 18, 2022 at 5:13 AM Jonathan Slenders <jonathan@slenders.be> wrote:
First time, it was needed in prompt_toolkit, where I spent a crazy amount of time looking for the most performant solution. Third time is for the Rich/Textual project from Will McGugan. (See: https://twitter.com/willmcgugan/status/1537782771137011715 )
Would you give me a pointer to the code? I want to know its use cases.
FWIW, I had proposed str.iterlines() to fix incompatibility between IO.readlines() and str.splitlines(). That will be much efficient than splitlines because it doesn't allocate a huge amount of strings at once. It allocates line string at a time. https://discuss.python.org/t/changing-str-splitlines-to-match-file-readlines... Of course, it will be still slower than your line_offsets() idea because it still need to allocate line strings many times.
I don't like string offsets so I don't like adding more methods returning offsets. Currently, string offset in Python is counted by code points. This is not efficient for Python implementations using UTF-8 (PyPy and MicroPython) or UTF-16 (Maybe Jython and IronPython, but I don't know) internally. CPython uses PEP 393 for now so offsets are efficient. But I want to change it to UTF-8 in the future. UTF-8 internal encoding is much efficient for many Python use cases like: * Read UTF-8 string from text and write it to UTF-8 console. * Read UTF-8 from database and write it to UTF-8 JSON. Additionally, there are many very fast string algorithms work with UTF-8 written in C or Rust. Python is glue language. Reducing overhead with such libraries is good for Python. For now, my recommendation is using some library written in Cython if it is performance critical. Regards, -- Inada Naoki <songofacandy@gmail.com>
participants (12)
-
Christopher Barker
-
Eric V. Smith
-
Inada Naoki
-
Jeremiah Gabriel Pascual
-
Jonathan Fine
-
Jonathan Slenders
-
Lucas Wiman
-
MRAB
-
n.musolino@gmail.com
-
Paul Moore
-
Stephen J. Turnbull
-
Steve Jorgensen