On Sat, Oct 06, 2018 at 02:00:27PM -0700, Nathaniel Smith wrote:
Fortunately, there's an elegant and natural solution: Just save the regex engine's internal state when it hits the end of the string, and then when more data arrives, use the saved state to pick up the search where we left off. Theoretically, any regex engine *could* support this – it's especially obvious for DFA-based matchers, but even backtrackers like Python's re could support it, basically by making the matching engine a coroutine that can suspend itself when it hits the end of the input, then resume it when new input arrives. Like, if you asked Knuth for the theoretically optimal design for this parser, I'm pretty sure this is what he'd tell you to use, and it's what people do when writing high-performance HTTP parsers in C.
The message I take from this is: - regex engines certainly can be written to support streaming data; - but few of them are; - and it is exceedingly unlikely to be able to easily (or at all) retro-fit that support to Python's existing re module. Perhaps the solution is a lightweight streaming DFA regex parser? Does anyone know whether MRAB's regex library supports this? https://pypi.org/project/regex/
you can't write efficient character-by-character algorithms in Python
I'm sure that Python will never be as efficient as C in that regard (although PyPy might argue the point) but is there something we can do to ameliorate this? If we could make char-by-char processing only 10 times less efficient than C instead of 100 times (let's say...) perhaps that would help Ram (and you?) with your use-cases? -- Steve