[Python-ideas] Support parsing stream with `re`

Sat Oct 6 18:54:30 EDT 2018

On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico <rosuav at gmail.com> wrote:
> On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <njs at pobox.com> wrote:
>>
>> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram at rachum.com> wrote:
>> > I'd like to use the re module to parse a long text file, 1GB in size. I wish
>> > that the re module could parse a stream, so I wouldn't have to load the
>> > whole thing into memory. I'd like to iterate over matches from the stream
>> > without keeping the old matches and input in RAM.
>> >
>> > What do you think?
>>
>> This has frustrated me too.
>>
>> The case where I've encountered this is parsing HTTP/1.1. We have data
>> coming in incrementally over the network, and we want to find the end
>> of the headers. To do this, we're looking for the first occurrence of
>> b"\r\n\r\n" OR b"\n\n".
>>
>> So our requirements are:
>>
>> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
>> 2. If there's no match yet, wait for more data to arrive and try again
>> 3. When more data arrives, start searching again *where the last
>> search left off*
>>
>> The last requirement is subtle, but important. The naive approach
>> would be to rescan your whole receive buffer after each new packet
>> arrives:
>>
>> end_of_headers = re.compile(b"\r\n\r\n|\n\n")
>> while True:
>>     m = end_of_headers.search(receive_buffer)
>>     if m is None:
>>         receive_buffer += await get_more_data_from_network()
>>         # loop around and try again
>>     else:
>>         break
>>
>> But the code above is quadratic! If the headers are N bytes long, then
>> on each pass through the loop we perform an O(N) regex search, and we
>> do O(N) passes through the loop, so the whole thing is O(N**2). That
>> means your HTTP client-or-server can be trivially DoSed by a peer who
>> sends their headers broken into lots of small fragments.
>
> Quadratic in the size of the headers only, so you could just cap it -
> if the receive buffer is too large, just reject it. Sometimes, the
> simplest approach is the best.

But OTOH, every problem has a solution that's simple, obvious, and wrong :-).

Of course you set a cap on the header size, to prevent other kinds of
DoS (e.g. memory exhaustion). But it turns out people stuff a lot of
data into HTTP headers [1], so if the cap is large enough to support
non-malicious usage, then it's also large enough to let people DoS the
naive O(N**2) algorithm. Production-quality HTTP/1.1 parsers really do
have to use an O(N) algorithm here.

And similarly, if you're building a generic helper library for people
implementing arbitrary unknown protocols, then you can't assume their
protocols were designed to use small frames only, to avoid hitting
arbitrary limitations in Python's re module.

-n

[1] E.g., 16 KiB total header size is already enough that on my
laptop, the naive O(N**2) algorithm takes ~750 ms CPU time, versus ~16
ms for the O(N) algorithm. HTTP/2 tried hard to simplify their header
encoding scheme by putting limits on header size, but got so much
push-back that they were eventually forced to add special hacks to
allow for arbitrarily large headers – in particular, it turns out that
people use *individual cookies* that are larger than 16 KiB:
https://http2.github.io/faq/#why-the-rules-around-continuation-on-headers-frames

-- 
Nathaniel J. Smith -- https://vorpus.org