On 2022-06-20 16:12, Christopher Barker wrote:
Hmm - I’m a bit confused about how you handle mixed / multiple line endings. If you use splitlines(), then it will remove the line endings, so if there are two-char line endings, then you’ll get off by one errors, yes?
I would think you could look for “\n”, and get the correct answer ( with extraneous “\r”s in the substrings…
-CHB
How about something like .split, but returning the spans instead of the strings?
On Mon, Jun 20, 2022 at 5:04 PM Christopher Barker <pythonchb@gmail.com <mailto:pythonchb@gmail.com>> wrote:
If you are working with bytes, then numpy could be perfect— not a small dependency of course, but it should work, and work fast.
And a cython method would be quite easy to write, but of course substantially harder to distribute :-(
-CHB
On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders <jonathan@slenders.be <mailto:jonathan@slenders.be>> wrote:
Thanks all for all the responses! That's quite a bit to think about.
A couple of thoughts:
1. First, I do support a transition to UTF-8, so I understand we don't want to add more methods that deal with character offsets. (I'm familiar with how strings work in Rust.) However, does that mean we won't be using/exposing any offset at all, or will it become possible to slice using byte offsets?
2. The commercial application I mentioned where this is critical is actually using bytes instead of str. Sorry for not mentioning earlier. We were doing the following: list(accumulate(chain([0], map(len, text.splitlines(True))))) where text is a bytes object. This is significantly faster than a binary regex for finding all universal line endings. This application is an asyncio web app that streams Cisco show-tech files (often several gigabytes) from a file server over HTTP; stores them chunk by chunk into a local cache file on disk; and builds a index of byte offsets in the meantime by running the above expression over every chunk. That way the client web app can quickly load the lines from disk as the user scrolls through the file. A very niche application indeed, so use of Cython would be acceptable in this particular case. I published the relevant snippet here to be studied: https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868 <https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868> It does handle an interesting edge case regarding UTF-16.
3. The code in prompt_toolkit can be found here: https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prom... <https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209> (It's not yet using 'accumulate' there, but for the rest it's the same.) Also here, universal line endings support is important, because the editing buffer can in theory contain a mix of line endings. It has to be performant, because it executes on every key stroke. In this case, a more complex data structure could probably solve performance issues here, but it's really not worth the complexity that it introduces in every text manipulation (like every key binding). Also try using the "re" library to search over a list of lines or anything that's not a simple string.
4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5 times faster than re. Imagine if splitlines() doesn't have to do the work to actually create the substrings, but only has to return the offsets, that should be even much faster and not require so much memory. (I have an benchmark that does it one chunk at a time, to prevent using too much memory: https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf <https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf> )
So talking about bytes. Would it be acceptable to have a `bytes.line_offsets()` method instead? Or `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or not? `str.splitlines(return_offsets=True)` would be very nice, but I understand the concerns.
It's somewhat frustrating here knowing that for `splitlines()`, the information is there, already computed, just not immediately accessible. (without having Python do lots of unnecessary work.)
Jonathan
Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2358@gmail.com <mailto:jfine2358@gmail.com>> a écrit :
Hi
This is a nice problem, well presented. Here's four comments / questions.
1. How does the introduction of faster CPython in Python 3.11 affect the benchmarks? 2. Is there an across-the-board change that would speedup this line-offsets task? 3. To limit splitlines memory use (at small performance cost), chunk the input string into say 4 kb blocks. 4. Perhaps anything done here for strings should also be done for bytes.