<div dir="ltr">"This is a regular expression problem, rather than a Python problem."<div><br></div><div>Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that streams aren't supported?</div><div><br></div><div><br></div><div>"Can the lexing be done on a line-by-line basis?"<br></div><div><br></div><div>For my use case, it unfortunately can't.</div></div><br><div class="gmail_quote"><div dir="ltr">On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine <<a href="mailto:jfine2358@gmail.com">jfine2358@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Ram<br>

<br>

You wrote:<br>

<br>

> I'd like to use the re module to parse a long text file, 1GB in size. I<br>

> wish that the re module could parse a stream, so I wouldn't have to load<br>

> the whole thing into memory. I'd like to iterate over matches from the<br>

> stream without keeping the old matches and input in RAM.<br>

<br>

This is a regular expression problem, rather than a Python problem. A search for<br>

    regular expression large file<br>

brings up some URLs that might help you, starting with<br>

<a href="https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow" rel="noreferrer" target="_blank">https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow</a><br>

<br>

This might also be helpful<br>

<a href="https://svn.boost.org/trac10/ticket/11776" rel="noreferrer" target="_blank">https://svn.boost.org/trac10/ticket/11776</a><br>

<br>

What will work for your problem depends on the nature of the problem<br>

you have. The simplest thing that might work is to iterate of the file<br>

line-by-line, and use a regular expression to extract matches from<br>

each line.<br>

<br>

In other words, something like (not tested)<br>

<br>

   def helper(lines):<br>

       for line in lines:<br>

           yield from re.finditer(pattern, line)<br>

<br>

    lines = open('my-big-file.txt')<br>

    for match in helper(lines):<br>

        # Do your stuff here<br>

<br>

Parsing is not the same as lexing, see<br>

<a href="https://en.wikipedia.org/wiki/Lexical_analysis" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Lexical_analysis</a><br>

<br>

I suggest you use regular expressions ONLY for the lexing phase. If<br>

you'd like further help, perhaps first ask yourself this. Can the<br>

lexing be done on a line-by-line basis? And if not, why not?<br>

<br>

If line-by-line not possible, then you'll have to modify the helper.<br>

At the end of each line, they'll be a residue / remainder, which<br>

you'll have to bring into the next line. In other words, the helper<br>

will have to record (and change) the state that exists at the end of<br>

each line. A bit like the 'carry' that is used when doing long<br>

addition.<br>

<br>

I hope this helps.<br>

<br>

-- <br>

Jonathan<br>

<br>

</blockquote></div>