On 10/6/2018 5:00 PM, Nathaniel Smith wrote:
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum firstname.lastname@example.org wrote:
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.
What do you think?
This has frustrated me too.
The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".
So our requirements are:
- Search a bytearray for the regex b"\r\n\r\n|\n\n"
I believe that re is both overkill and slow for this particular problem. For O(n), search forward for \n with str.index('\n') (or .find) [I assume that this searches forward faster than for i, c in enumerate(s): if c == '\n': break and leave you to test this.]
If not found, continue with next chunk of data. If found, look back for \r to determine whether to look forward for \n or \r\n *whenever there is enough data to do so.
- If there's no match yet, wait for more data to arrive and try again
- When more data arrives, start searching again *where the last
search left off*
s.index has an optional start parameter. And keep chunks in a list until you have a match and can join all at once.