Re: [Python-ideas] Support parsing stream with `re`

Oct. 8, 2018

      On Sun, Oct 7, 2018 at 5:09 PM, Terry Reedy <tjreedy@udel.edu> wrote:
...
On 10/6/2018 5:00 PM, Nathaniel Smith wrote:
...
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:
...
I'd like to use the re module to parse a long text file, 1GB in size. I
wish
that the re module could parse a stream, so I wouldn't have to load the
whole thing into memory. I'd like to iterate over matches from the stream
without keeping the old matches and input in RAM.
What do you think?
This has frustrated me too.
The case where I've encountered this is parsing HTTP/1.1. We have data
coming in incrementally over the network, and we want to find the end
of the headers. To do this, we're looking for the first occurrence of
b"\r\n\r\n" OR b"\n\n".
So our requirements are:
1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
I believe that re is both overkill and slow for this particular problem.
For O(n), search forward for \n with str.index('\n') (or .find)
[I assume that this searches forward faster than
for i, c in enumerate(s):
   if c == '\n': break
and leave you to test this.]
If not found, continue with next chunk of data.
If found, look back for \r to determine whether to look forward for \n or
\r\n *whenever there is enough data to do so.
Are you imagining something roughly like this? (Ignoring chunk
boundary handling for the moment.)

def find_double_line_end(buf):
    start = 0
    while True:
        next_idx = buf.index(b"\n", start)
        if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx -
3:next_idx] == b"\r\n\r":
            return next_idx
        start = next_idx + 1

That's much more complicated than using re.search, and on some random
HTTP headers I have lying around it benchmarks ~70% slower too. Which
makes sense, since we're basically trying to replicate re engine's
work by hand in a slower language.

BTW, if we only want to find a fixed string like b"\r\n\r\n", then
re.search and bytearray.index are almost identical in speed. If you
have a problem that can be expressed as a regular expression, then
regular expression engines are actually pretty good at solving those
:-)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org