[Python-ideas] Support parsing stream with `re`
jfine2358 at gmail.com
Sat Oct 6 06:53:15 EDT 2018
> I'd like to use the re module to parse a long text file, 1GB in size. I
> wish that the re module could parse a stream, so I wouldn't have to load
> the whole thing into memory. I'd like to iterate over matches from the
> stream without keeping the old matches and input in RAM.
This is a regular expression problem, rather than a Python problem. A search for
regular expression large file
brings up some URLs that might help you, starting with
This might also be helpful
What will work for your problem depends on the nature of the problem
you have. The simplest thing that might work is to iterate of the file
line-by-line, and use a regular expression to extract matches from
In other words, something like (not tested)
for line in lines:
yield from re.finditer(pattern, line)
lines = open('my-big-file.txt')
for match in helper(lines):
# Do your stuff here
Parsing is not the same as lexing, see
I suggest you use regular expressions ONLY for the lexing phase. If
you'd like further help, perhaps first ask yourself this. Can the
lexing be done on a line-by-line basis? And if not, why not?
If line-by-line not possible, then you'll have to modify the helper.
At the end of each line, they'll be a residue / remainder, which
you'll have to bring into the next line. In other words, the helper
will have to record (and change) the state that exists at the end of
each line. A bit like the 'carry' that is used when doing long
I hope this helps.
More information about the Python-ideas