[Python-ideas] Support parsing stream with `re`
Nathaniel Smith
njs at pobox.com
Sun Oct 7 21:40:47 EDT 2018
On Sun, Oct 7, 2018 at 5:54 PM, Nathaniel Smith <njs at pobox.com> wrote:
> Are you imagining something roughly like this? (Ignoring chunk
> boundary handling for the moment.)
>
> def find_double_line_end(buf):
> start = 0
> while True:
> next_idx = buf.index(b"\n", start)
> if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx -
> 3:next_idx] == b"\r\n\r":
> return next_idx
> start = next_idx + 1
>
> That's much more complicated than using re.search, and on some random
> HTTP headers I have lying around it benchmarks ~70% slower too. Which
> makes sense, since we're basically trying to replicate re engine's
> work by hand in a slower language.
>
> BTW, if we only want to find a fixed string like b"\r\n\r\n", then
> re.search and bytearray.index are almost identical in speed. If you
> have a problem that can be expressed as a regular expression, then
> regular expression engines are actually pretty good at solving those
> :-)
Though... here's something strange.
Here's another way to search for the first appearance of either
\r\n\r\n or \n\n in a bytearray:
def find_double_line_end_2(buf):
idx1 = buf.find(b"\r\n\r\n")
idx2 = buf.find(b"\n\n", 0, idx1)
if idx1 == -1:
return idx2
elif idx2 == -1:
return idx1
else:
return min(idx1, idx2)
So this is essentially equivalent to our regex (notice they both pick
out position 505 as the end of the headers):
In [52]: find_double_line_end_2(sample_headers)
Out[52]: 505
In [53]: double_line_end_re = re.compile(b"\r\n\r\n|\n\n")
In [54]: double_line_end_re.search(sample_headers)
Out[54]: <_sre.SRE_Match object; span=(505, 509), match=b'\r\n\r\n'>
But, the Python function that calls bytearray.find twice is about ~3x
faster than the re module:
In [55]: %timeit find_double_line_end_2(sample_headers)
1.18 µs ± 40 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [56]: %timeit double_line_end_re.search(sample_headers)
3.3 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The regex module is even slower:
In [57]: double_line_end_regex = regex.compile(b"\r\n\r\n|\n\n")
In [58]: %timeit double_line_end_regex.search(sample_headers)
4.95 µs ± 76.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
-n
--
Nathaniel J. Smith -- https://vorpus.org
More information about the Python-ideas
mailing list