On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith firstname.lastname@example.org wrote:
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum email@example.com wrote:
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.
What do you think?
This has frustrated me too.
The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".
So our requirements are:
- Search a bytearray for the regex b"\r\n\r\n|\n\n"
- If there's no match yet, wait for more data to arrive and try again
- When more data arrives, start searching again *where the last
search left off*
The last requirement is subtle, but important. The naive approach would be to rescan your whole receive buffer after each new packet arrives:
end_of_headers = re.compile(b"\r\n\r\n|\n\n") while True: m = end_of_headers.search(receive_buffer) if m is None: receive_buffer += await get_more_data_from_network() # loop around and try again else: break
But the code above is quadratic! If the headers are N bytes long, then on each pass through the loop we perform an O(N) regex search, and we do O(N) passes through the loop, so the whole thing is O(N**2). That means your HTTP client-or-server can be trivially DoSed by a peer who sends their headers broken into lots of small fragments.
Quadratic in the size of the headers only, so you could just cap it - if the receive buffer is too large, just reject it. Sometimes, the simplest approach is the best.