[Python-ideas] Support parsing stream with `re`

Jonathan Fine jfine2358 at gmail.com
Sat Oct 6 06:53:15 EDT 2018


Hi Ram

You wrote:

> I'd like to use the re module to parse a long text file, 1GB in size. I
> wish that the re module could parse a stream, so I wouldn't have to load
> the whole thing into memory. I'd like to iterate over matches from the
> stream without keeping the old matches and input in RAM.

This is a regular expression problem, rather than a Python problem. A search for
    regular expression large file
brings up some URLs that might help you, starting with
https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow

This might also be helpful
https://svn.boost.org/trac10/ticket/11776

What will work for your problem depends on the nature of the problem
you have. The simplest thing that might work is to iterate of the file
line-by-line, and use a regular expression to extract matches from
each line.

In other words, something like (not tested)

   def helper(lines):
       for line in lines:
           yield from re.finditer(pattern, line)

    lines = open('my-big-file.txt')
    for match in helper(lines):
        # Do your stuff here

Parsing is not the same as lexing, see
https://en.wikipedia.org/wiki/Lexical_analysis

I suggest you use regular expressions ONLY for the lexing phase. If
you'd like further help, perhaps first ask yourself this. Can the
lexing be done on a line-by-line basis? And if not, why not?

If line-by-line not possible, then you'll have to modify the helper.
At the end of each line, they'll be a residue / remainder, which
you'll have to bring into the next line. In other words, the helper
will have to record (and change) the state that exists at the end of
each line. A bit like the 'carry' that is used when doing long
addition.

I hope this helps.

-- 
Jonathan


More information about the Python-ideas mailing list