[Python-ideas] Support parsing stream with `re`
rosuav at gmail.com
Sat Oct 6 17:04:20 EDT 2018
On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <njs at pobox.com> wrote:
> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram at rachum.com> wrote:
> > I'd like to use the re module to parse a long text file, 1GB in size. I wish
> > that the re module could parse a stream, so I wouldn't have to load the
> > whole thing into memory. I'd like to iterate over matches from the stream
> > without keeping the old matches and input in RAM.
> > What do you think?
> This has frustrated me too.
> The case where I've encountered this is parsing HTTP/1.1. We have data
> coming in incrementally over the network, and we want to find the end
> of the headers. To do this, we're looking for the first occurrence of
> b"\r\n\r\n" OR b"\n\n".
> So our requirements are:
> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
> 2. If there's no match yet, wait for more data to arrive and try again
> 3. When more data arrives, start searching again *where the last
> search left off*
> The last requirement is subtle, but important. The naive approach
> would be to rescan your whole receive buffer after each new packet
> end_of_headers = re.compile(b"\r\n\r\n|\n\n")
> while True:
> m = end_of_headers.search(receive_buffer)
> if m is None:
> receive_buffer += await get_more_data_from_network()
> # loop around and try again
> But the code above is quadratic! If the headers are N bytes long, then
> on each pass through the loop we perform an O(N) regex search, and we
> do O(N) passes through the loop, so the whole thing is O(N**2). That
> means your HTTP client-or-server can be trivially DoSed by a peer who
> sends their headers broken into lots of small fragments.
Quadratic in the size of the headers only, so you could just cap it -
if the receive buffer is too large, just reject it. Sometimes, the
simplest approach is the best.
More information about the Python-ideas