[Python-ideas] Support parsing stream with `re`

Sun Oct 7 20:54:16 EDT 2018

On Sun, Oct 7, 2018 at 5:09 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 10/6/2018 5:00 PM, Nathaniel Smith wrote:
>>
>> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram at rachum.com> wrote:
>>>
>>> I'd like to use the re module to parse a long text file, 1GB in size. I
>>> wish
>>> that the re module could parse a stream, so I wouldn't have to load the
>>> whole thing into memory. I'd like to iterate over matches from the stream
>>> without keeping the old matches and input in RAM.
>>>
>>> What do you think?
>>
>>
>> This has frustrated me too.
>>
>> The case where I've encountered this is parsing HTTP/1.1. We have data
>> coming in incrementally over the network, and we want to find the end
>> of the headers. To do this, we're looking for the first occurrence of
>> b"\r\n\r\n" OR b"\n\n".
>>
>> So our requirements are:
>>
>> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
>
>
> I believe that re is both overkill and slow for this particular problem.
> For O(n), search forward for \n with str.index('\n') (or .find)
> [I assume that this searches forward faster than
> for i, c in enumerate(s):
>    if c == '\n': break
> and leave you to test this.]
>
> If not found, continue with next chunk of data.
> If found, look back for \r to determine whether to look forward for \n or
> \r\n *whenever there is enough data to do so.

Are you imagining something roughly like this? (Ignoring chunk
boundary handling for the moment.)

def find_double_line_end(buf):
    start = 0
    while True:
        next_idx = buf.index(b"\n", start)
        if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx -
3:next_idx] == b"\r\n\r":
            return next_idx
        start = next_idx + 1

That's much more complicated than using re.search, and on some random
HTTP headers I have lying around it benchmarks ~70% slower too. Which
makes sense, since we're basically trying to replicate re engine's
work by hand in a slower language.

BTW, if we only want to find a fixed string like b"\r\n\r\n", then
re.search and bytearray.index are almost identical in speed. If you
have a problem that can be expressed as a regular expression, then
regular expression engines are actually pretty good at solving those
:-)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org