"This is a regular expression problem, rather than a Python problem."
Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that streams aren't supported?
"Can the lexing be done on a line-by-line basis?"
For my use case, it unfortunately can't.
On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine email@example.com wrote:
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.
This is a regular expression problem, rather than a Python problem. A search for regular expression large file brings up some URLs that might help you, starting with
This might also be helpful https://svn.boost.org/trac10/ticket/11776
What will work for your problem depends on the nature of the problem you have. The simplest thing that might work is to iterate of the file line-by-line, and use a regular expression to extract matches from each line.
In other words, something like (not tested)
def helper(lines): for line in lines: yield from re.finditer(pattern, line)
lines = open('my-big-file.txt') for match in helper(lines): # Do your stuff here
Parsing is not the same as lexing, see https://en.wikipedia.org/wiki/Lexical_analysis
I suggest you use regular expressions ONLY for the lexing phase. If you'd like further help, perhaps first ask yourself this. Can the lexing be done on a line-by-line basis? And if not, why not?
If line-by-line not possible, then you'll have to modify the helper. At the end of each line, they'll be a residue / remainder, which you'll have to bring into the next line. In other words, the helper will have to record (and change) the state that exists at the end of each line. A bit like the 'carry' that is used when doing long addition.
I hope this helps.