![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Sun, Oct 7, 2018 at 5:09 PM, Terry Reedy <tjreedy@udel.edu> wrote:
On 10/6/2018 5:00 PM, Nathaniel Smith wrote:
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.
What do you think?
This has frustrated me too.
The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".
So our requirements are:
1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
I believe that re is both overkill and slow for this particular problem. For O(n), search forward for \n with str.index('\n') (or .find) [I assume that this searches forward faster than for i, c in enumerate(s): if c == '\n': break and leave you to test this.]
If not found, continue with next chunk of data. If found, look back for \r to determine whether to look forward for \n or \r\n *whenever there is enough data to do so.
Are you imagining something roughly like this? (Ignoring chunk boundary handling for the moment.) def find_double_line_end(buf): start = 0 while True: next_idx = buf.index(b"\n", start) if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx - 3:next_idx] == b"\r\n\r": return next_idx start = next_idx + 1 That's much more complicated than using re.search, and on some random HTTP headers I have lying around it benchmarks ~70% slower too. Which makes sense, since we're basically trying to replicate re engine's work by hand in a slower language. BTW, if we only want to find a fixed string like b"\r\n\r\n", then re.search and bytearray.index are almost identical in speed. If you have a problem that can be expressed as a regular expression, then regular expression engines are actually pretty good at solving those :-) -n -- Nathaniel J. Smith -- https://vorpus.org