[Python-ideas] Support parsing stream with `re`

Sat Oct 6 17:04:20 EDT 2018

On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <njs at pobox.com> wrote:
>
> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram at rachum.com> wrote:
> > I'd like to use the re module to parse a long text file, 1GB in size. I wish
> > that the re module could parse a stream, so I wouldn't have to load the
> > whole thing into memory. I'd like to iterate over matches from the stream
> > without keeping the old matches and input in RAM.
> >
> > What do you think?
>
> This has frustrated me too.
>
> The case where I've encountered this is parsing HTTP/1.1. We have data
> coming in incrementally over the network, and we want to find the end
> of the headers. To do this, we're looking for the first occurrence of
> b"\r\n\r\n" OR b"\n\n".
>
> So our requirements are:
>
> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
> 2. If there's no match yet, wait for more data to arrive and try again
> 3. When more data arrives, start searching again *where the last
> search left off*
>
> The last requirement is subtle, but important. The naive approach
> would be to rescan your whole receive buffer after each new packet
> arrives:
>
> end_of_headers = re.compile(b"\r\n\r\n|\n\n")
> while True:
>     m = end_of_headers.search(receive_buffer)
>     if m is None:
>         receive_buffer += await get_more_data_from_network()
>         # loop around and try again
>     else:
>         break
>
> But the code above is quadratic! If the headers are N bytes long, then
> on each pass through the loop we perform an O(N) regex search, and we
> do O(N) passes through the loop, so the whole thing is O(N**2). That
> means your HTTP client-or-server can be trivially DoSed by a peer who
> sends their headers broken into lots of small fragments.

Quadratic in the size of the headers only, so you could just cap it -
if the receive buffer is too large, just reject it. Sometimes, the
simplest approach is the best.

ChrisA