[Python-ideas] Support parsing stream with `re`
rosuav at gmail.com
Sat Oct 6 18:56:59 EDT 2018
On Sun, Oct 7, 2018 at 9:54 AM Nathaniel Smith <njs at pobox.com> wrote:
> On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico <rosuav at gmail.com> wrote:
> > On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <njs at pobox.com> wrote:
> >> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram at rachum.com> wrote:
> >> > I'd like to use the re module to parse a long text file, 1GB in size. I wish
> >> > that the re module could parse a stream, so I wouldn't have to load the
> >> > whole thing into memory. I'd like to iterate over matches from the stream
> >> > without keeping the old matches and input in RAM.
> >> >
> >> > What do you think?
> >> This has frustrated me too.
> >> The case where I've encountered this is parsing HTTP/1.1. We have data
> >> coming in incrementally over the network, and we want to find the end
> >> of the headers. To do this, we're looking for the first occurrence of
> >> b"\r\n\r\n" OR b"\n\n".
> >> So our requirements are:
> >> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
> >> 2. If there's no match yet, wait for more data to arrive and try again
> >> 3. When more data arrives, start searching again *where the last
> >> search left off*
> >> The last requirement is subtle, but important. The naive approach
> >> would be to rescan your whole receive buffer after each new packet
> >> arrives:
> >> end_of_headers = re.compile(b"\r\n\r\n|\n\n")
> >> while True:
> >> m = end_of_headers.search(receive_buffer)
> >> if m is None:
> >> receive_buffer += await get_more_data_from_network()
> >> # loop around and try again
> >> else:
> >> break
> >> But the code above is quadratic! If the headers are N bytes long, then
> >> on each pass through the loop we perform an O(N) regex search, and we
> >> do O(N) passes through the loop, so the whole thing is O(N**2). That
> >> means your HTTP client-or-server can be trivially DoSed by a peer who
> >> sends their headers broken into lots of small fragments.
> > Quadratic in the size of the headers only, so you could just cap it -
> > if the receive buffer is too large, just reject it. Sometimes, the
> > simplest approach is the best.
> But OTOH, every problem has a solution that's simple, obvious, and wrong :-).
> Of course you set a cap on the header size, to prevent other kinds of
> DoS (e.g. memory exhaustion). But it turns out people stuff a lot of
> data into HTTP headers , so if the cap is large enough to support
> non-malicious usage, then it's also large enough to let people DoS the
> naive O(N**2) algorithm. Production-quality HTTP/1.1 parsers really do
> have to use an O(N) algorithm here.
Fair enough! You could instead use a simple linear parser rather than
searching for end of headers first. That way, all you're looking for
is a \n, no regex needed. Sometimes the VERY simplest solution doesn't
work, but I still do like to keep things as simple as possible :)
More information about the Python-ideas