[Python-ideas] Re: Add a line_offsets() method to str

20 Jun 2022


      On 2022-06-20 16:12, Christopher Barker wrote:
...
Hmm - I’m a bit confused about how you handle mixed / multiple line 
endings. If you use splitlines(), then it will remove the line endings, 
so if there are two-char line endings, then you’ll get off by one 
errors, yes?
I would think you could look for “\n”, and get the correct answer ( with 
extraneous “\r”s in the substrings…
-CHB
How about something like .split, but returning the spans instead of the 
strings?
...
On Mon, Jun 20, 2022 at 5:04 PM Christopher Barker <pythonchb@gmail.com 
<mailto:pythonchb@gmail.com>> wrote:
If you are working with bytes, then numpy could be perfect— not a
    small dependency of course, but it should work, and work fast.
And a cython method would be quite easy to write, but of course
    substantially harder to distribute :-(
-CHB
On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders
    <jonathan@slenders.be <mailto:jonathan@slenders.be>> wrote:
Thanks all for all the responses! That's quite a bit to think about.
A couple of thoughts:
1. First, I do support a transition to UTF-8, so I understand we
        don't want to add more methods that deal with character offsets.
        (I'm familiar with how strings work in Rust.) However, does that
        mean we won't be using/exposing any offset at all, or will it
        become possible to slice using byte offsets?
2. The commercial application I mentioned where this is critical
        is actually using bytes instead of str. Sorry for not mentioning
        earlier. We were doing the following:
             list(accumulate(chain([0], map(len, text.splitlines(True)))))
        where text is a bytes object. This is significantly faster than
        a binary regex for finding all universal line endings. This
        application is an asyncio web app that streams Cisco show-tech
        files (often several gigabytes) from a file server over HTTP;
        stores them chunk by chunk into a local cache file on disk; and
        builds a index of byte offsets in the meantime by running the
        above expression over every chunk. That way the client web app
        can quickly load the lines from disk as the user scrolls through
        the file. A very niche application indeed, so use of Cython
        would be acceptable in this particular case. I published the
        relevant snippet here to be studied:
        https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868
        <https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868>
        It does handle an interesting edge case regarding UTF-16.
3. The code in prompt_toolkit can be found here:
        https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prom...
        <https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209>
        (It's not yet using 'accumulate' there, but for the rest it's
        the same.) Also here, universal line endings support is
        important, because the editing buffer can in theory contain a
        mix of line endings. It has to be performant, because it
        executes on every key stroke. In this case, a more complex data
        structure could probably solve performance issues here, but it's
        really not worth the complexity that it introduces in every text
        manipulation (like every key binding). Also try using the "re"
        library to search over a list of lines or anything that's not a
        simple string.
4. I tested on 3.11.0b3. Using the splitlines() approach is
        still 2.5 times faster than re. Imagine if splitlines() doesn't
        have to do the work to actually create the substrings, but only
        has to return the offsets, that should be even much faster and
        not require so much memory. (I have an benchmark that does it
        one chunk at a time, to prevent using too much memory:
        https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf
        <https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf>
        )
So talking about bytes. Would it be acceptable to have a
        `bytes.line_offsets()` method instead? Or
        `bytes.splitlines(return_offsets=True)`? Because byte offsets
        are okay, or not? `str.splitlines(return_offsets=True)` would be
        very nice, but I understand the concerns.
It's somewhat frustrating here knowing that for `splitlines()`,
        the information is there, already computed, just not immediately
        accessible. (without having Python do lots of unnecessary work.)
Jonathan
Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2358@gmail.com
        <mailto:jfine2358@gmail.com>> a écrit :
Hi
This is a nice problem, well presented. Here's four comments
            / questions.
1. How does the introduction of faster CPython in Python
            3.11 affect the benchmarks?
            2. Is there an across-the-board change that would speedup
            this line-offsets task?
            3. To limit splitlines memory use (at small performance
            cost), chunk the input string into say 4 kb blocks.
            4. Perhaps anything done here for strings should also be
            done for bytes.

[Python-ideas] Re: Add a line_offsets() method to str

MRAB