[Python-ideas] Support parsing stream with `re`

Steven D'Aprano steve at pearwood.info
Sat Oct 6 21:39:48 EDT 2018


On Sat, Oct 06, 2018 at 02:00:27PM -0700, Nathaniel Smith wrote:

> Fortunately, there's an elegant and natural solution: Just save the
> regex engine's internal state when it hits the end of the string, and
> then when more data arrives, use the saved state to pick up the search
> where we left off. Theoretically, any regex engine *could* support
> this – it's especially obvious for DFA-based matchers, but even
> backtrackers like Python's re could support it, basically by making
> the matching engine a coroutine that can suspend itself when it hits
> the end of the input, then resume it when new input arrives. Like, if
> you asked Knuth for the theoretically optimal design for this parser,
> I'm pretty sure this is what he'd tell you to use, and it's what
> people do when writing high-performance HTTP parsers in C.

The message I take from this is:

- regex engines certainly can be written to support streaming data;
- but few of them are;
- and it is exceedingly unlikely to be able to easily (or at all) 
  retro-fit that support to Python's existing re module.

Perhaps the solution is a lightweight streaming DFA regex parser?

Does anyone know whether MRAB's regex library supports this?

https://pypi.org/project/regex/


> you can't write efficient
> character-by-character algorithms in Python

I'm sure that Python will never be as efficient as C in that regard 
(although PyPy might argue the point) but is there something we can do 
to ameliorate this? If we could make char-by-char processing only 10 
times less efficient than C instead of 100 times (let's say...) perhaps 
that would help Ram (and you?) with your use-cases?


-- 
Steve


More information about the Python-ideas mailing list