[Python-ideas] Support parsing stream with `re`
Steven D'Aprano
steve at pearwood.info
Sat Oct 6 21:39:48 EDT 2018
On Sat, Oct 06, 2018 at 02:00:27PM -0700, Nathaniel Smith wrote:
> Fortunately, there's an elegant and natural solution: Just save the
> regex engine's internal state when it hits the end of the string, and
> then when more data arrives, use the saved state to pick up the search
> where we left off. Theoretically, any regex engine *could* support
> this – it's especially obvious for DFA-based matchers, but even
> backtrackers like Python's re could support it, basically by making
> the matching engine a coroutine that can suspend itself when it hits
> the end of the input, then resume it when new input arrives. Like, if
> you asked Knuth for the theoretically optimal design for this parser,
> I'm pretty sure this is what he'd tell you to use, and it's what
> people do when writing high-performance HTTP parsers in C.
The message I take from this is:
- regex engines certainly can be written to support streaming data;
- but few of them are;
- and it is exceedingly unlikely to be able to easily (or at all)
retro-fit that support to Python's existing re module.
Perhaps the solution is a lightweight streaming DFA regex parser?
Does anyone know whether MRAB's regex library supports this?
https://pypi.org/project/regex/
> you can't write efficient
> character-by-character algorithms in Python
I'm sure that Python will never be as efficient as C in that regard
(although PyPy might argue the point) but is there something we can do
to ameliorate this? If we could make char-by-char processing only 10
times less efficient than C instead of 100 times (let's say...) perhaps
that would help Ram (and you?) with your use-cases?
--
Steve
More information about the Python-ideas
mailing list