Mailman 3 Support parsing stream with `re` - Python-ideas

Support parsing stream with `re`

Ram Rachum

Oct. 6, 2018

7:22 a.m.

Hi, I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM. What do you think? Thanks, Ram.

Attachments:

attachment.htm (text/html — 621 bytes)

Show replies by date

Serhiy Storchaka

October 2018

8:21 a.m.

06.10.18 10:22, Ram Rachum пише:

...

I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

How would you match 'a.*b' without loading the whole thing into memory?

Ram Rachum

8:32 a.m.

It'll load as much as it needs to in order to match or rule out a match on a pattern. If you'd try to match `a.*b` it'll load the whole thing. The use cases that are relevant to a stream wouldn't have these kinds of problems. On Sat, Oct 6, 2018 at 11:22 AM Serhiy Storchaka <storchaka@gmail.com> wrote:

...

06.10.18 10:22, Ram Rachum пише:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

How would you match 'a.*b' without loading the whole thing into memory?

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Jonathan Fine

10:53 a.m.

Hi Ram You wrote:

...

I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

This is a regular expression problem, rather than a Python problem. A search for regular expression large file brings up some URLs that might help you, starting with https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very... This might also be helpful https://svn.boost.org/trac10/ticket/11776 What will work for your problem depends on the nature of the problem you have. The simplest thing that might work is to iterate of the file line-by-line, and use a regular expression to extract matches from each line. In other words, something like (not tested) def helper(lines): for line in lines: yield from re.finditer(pattern, line) lines = open('my-big-file.txt') for match in helper(lines): # Do your stuff here Parsing is not the same as lexing, see https://en.wikipedia.org/wiki/Lexical_analysis I suggest you use regular expressions ONLY for the lexing phase. If you'd like further help, perhaps first ask yourself this. Can the lexing be done on a line-by-line basis? And if not, why not? If line-by-line not possible, then you'll have to modify the helper. At the end of each line, they'll be a residue / remainder, which you'll have to bring into the next line. In other words, the helper will have to record (and change) the state that exists at the end of each line. A bit like the 'carry' that is used when doing long addition. I hope this helps. -- Jonathan

Ram Rachum

11:25 a.m.

"This is a regular expression problem, rather than a Python problem." Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that streams aren't supported? "Can the lexing be done on a line-by-line basis?" For my use case, it unfortunately can't. On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine <jfine2358@gmail.com> wrote:

...

Hi Ram

You wrote:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

This is a regular expression problem, rather than a Python problem. A search for regular expression large file brings up some URLs that might help you, starting with

https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very...

This might also be helpful https://svn.boost.org/trac10/ticket/11776

What will work for your problem depends on the nature of the problem you have. The simplest thing that might work is to iterate of the file line-by-line, and use a regular expression to extract matches from each line.

In other words, something like (not tested)

def helper(lines): for line in lines: yield from re.finditer(pattern, line)

lines = open('my-big-file.txt') for match in helper(lines): # Do your stuff here

Parsing is not the same as lexing, see https://en.wikipedia.org/wiki/Lexical_analysis

I suggest you use regular expressions ONLY for the lexing phase. If you'd like further help, perhaps first ask yourself this. Can the lexing be done on a line-by-line basis? And if not, why not?

If line-by-line not possible, then you'll have to modify the helper. At the end of each line, they'll be a residue / remainder, which you'll have to bring into the next line. In other words, the helper will have to record (and change) the state that exists at the end of each line. A bit like the 'carry' that is used when doing long addition.

I hope this helps.

-- Jonathan

Jonathan Fine

11:37 a.m.

I wrote:

...

This is a regular expression problem, rather than a Python problem.

Ram wrote:

...

Do you have evidence for this assertion, except that other regex implementations have this limitation?

Yes. 1. I've already supplied: https://svn.boost.org/trac10/ticket/11776 2. https://en.wikipedia.org/wiki/Regular_expression does not contain the word 'stream'. Ram wrote:

...

For my use case, it unfortunately can't.

For that case, Ram, I've already given my best advice. If you get a better idea, please do follow it. And share it with us, so we can all learn. -- Jonathan

Ned Batchelder

3:57 p.m.

On 10/6/18 7:25 AM, Ram Rachum wrote:

...

"This is a regular expression problem, rather than a Python problem."

Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that streams aren't supported?

"Can the lexing be done on a line-by-line basis?"

For my use case, it unfortunately can't.

You mentioned earlier that your use case doesn't have to worry about the "a.*b" problem. Can you tell us more about your scenario? How would the stream know it had read enough to match or not match? Perhaps that same logic can be used to feed the data in chunks? --Ned.

...

On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine <jfine2358@gmail.com <mailto:jfine2358@gmail.com>> wrote:

Hi Ram

You wrote:

> I'd like to use the re module to parse a long text file, 1GB in size. I > wish that the re module could parse a stream, so I wouldn't have to load > the whole thing into memory. I'd like to iterate over matches from the > stream without keeping the old matches and input in RAM.

This is a regular expression problem, rather than a Python problem. A search for regular expression large file brings up some URLs that might help you, starting with https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very...

This might also be helpful https://svn.boost.org/trac10/ticket/11776

What will work for your problem depends on the nature of the problem you have. The simplest thing that might work is to iterate of the file line-by-line, and use a regular expression to extract matches from each line.

In other words, something like (not tested)

def helper(lines): for line in lines: yield from re.finditer(pattern, line)

lines = open('my-big-file.txt') for match in helper(lines): # Do your stuff here

Parsing is not the same as lexing, see https://en.wikipedia.org/wiki/Lexical_analysis

I suggest you use regular expressions ONLY for the lexing phase. If you'd like further help, perhaps first ask yourself this. Can the lexing be done on a line-by-line basis? And if not, why not?

If line-by-line not possible, then you'll have to modify the helper. At the end of each line, they'll be a residue / remainder, which you'll have to bring into the next line. In other words, the helper will have to record (and change) the state that exists at the end of each line. A bit like the 'carry' that is used when doing long addition.

I hope this helps.

-- Jonathan

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Ram Rachum

4:30 a.m.

Hi Ned! I'm happy to see you here. I'm doing multi-color 3d-printing. The slicing software generates a GCode file, which is a text file of instructions for the printer, each command meaning something like "move the head to coordinates x,y,z while extruding plastic at a rate of w" and lots of other administrative commands. (Turn the print fan on, heat the extruder to a certain temperature, calibrate the printer, etc.) Here's an example of a simple GCode from a print I did last week: https://www.dropbox.com/s/kzmm6v8ilcn0aik/JPL%20Go%20hook.gcode?dl=0 It's 1.8MB in size. They could get to 1GB for complex prints. Multi-color prints means that at some points in the print, usually in a layer change, I'm changing the color. This means I need to insert an M600 command, which tells the printer to pause the print, move the head around, and give me a prompt to change the filament before continuing printing. I'm sick of injecting the M600 manually after every print. I've been doing that for the last year. I'm working on a script so I could say "Insert an M600 command after layers 5, 88 and 234, and also before process Foo." The slicer inserts comments saying "; layer 234" Or "; process Foo". I want to identify the entire layer as one match. That's because I want to find the layer and possibly process at the start, I want to find the last retraction command, the first extrusion command in the new layer, etc. So it's a regex that spans potentially thousands of lines. Then I'll know just where to put my M600 and how much retraction to do afterwards. Thanks, Ram. On Sat, Oct 6, 2018 at 6:58 PM Ned Batchelder <ned@nedbatchelder.com> wrote:

...

On 10/6/18 7:25 AM, Ram Rachum wrote:

"This is a regular expression problem, rather than a Python problem."

Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that streams aren't supported?

"Can the lexing be done on a line-by-line basis?"

For my use case, it unfortunately can't.

You mentioned earlier that your use case doesn't have to worry about the "a.*b" problem. Can you tell us more about your scenario? How would the stream know it had read enough to match or not match? Perhaps that same logic can be used to feed the data in chunks?

--Ned.

On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine <jfine2358@gmail.com> wrote:

...
Hi Ram

You wrote:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

This is a regular expression problem, rather than a Python problem. A search for regular expression large file brings up some URLs that might help you, starting with

https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very...

This might also be helpful https://svn.boost.org/trac10/ticket/11776

What will work for your problem depends on the nature of the problem you have. The simplest thing that might work is to iterate of the file line-by-line, and use a regular expression to extract matches from each line.

In other words, something like (not tested)

def helper(lines): for line in lines: yield from re.finditer(pattern, line)

lines = open('my-big-file.txt') for match in helper(lines): # Do your stuff here

Parsing is not the same as lexing, see https://en.wikipedia.org/wiki/Lexical_analysis

I suggest you use regular expressions ONLY for the lexing phase. If you'd like further help, perhaps first ask yourself this. Can the lexing be done on a line-by-line basis? And if not, why not?

If line-by-line not possible, then you'll have to modify the helper. At the end of each line, they'll be a residue / remainder, which you'll have to bring into the next line. In other words, the helper will have to record (and change) the state that exists at the end of each line. A bit like the 'carry' that is used when doing long addition.

I hope this helps.

-- Jonathan

_______________________________________________ Python-ideas mailing listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Cameron Simpson

6:58 a.m.

On 07Oct2018 07:30, Ram Rachum <ram@rachum.com> wrote:

...

I'm doing multi-color 3d-printing. The slicing software generates a GCode file, which is a text file of instructions for the printer, each command meaning something like "move the head to coordinates x,y,z while extruding plastic at a rate of w" and lots of other administrative commands. (Turn the print fan on, heat the extruder to a certain temperature, calibrate the printer, etc.)

Here's an example of a simple GCode from a print I did last week: https://www.dropbox.com/s/kzmm6v8ilcn0aik/JPL%20Go%20hook.gcode?dl=0

It's 1.8MB in size. They could get to 1GB for complex prints.

Multi-color prints means that at some points in the print, usually in a layer change, I'm changing the color. This means I need to insert an M600 command, which tells the printer to pause the print, move the head around, and give me a prompt to change the filament before continuing printing.

I'm sick of injecting the M600 manually after every print. I've been doing that for the last year. I'm working on a script so I could say "Insert an M600 command after layers 5, 88 and 234, and also before process Foo."

The slicer inserts comments saying "; layer 234" Or "; process Foo". I want to identify the entire layer as one match. That's because I want to find the layer and possibly process at the start, I want to find the last retraction command, the first extrusion command in the new layer, etc. So it's a regex that spans potentially thousands of lines.

Then I'll know just where to put my M600 and how much retraction to do afterwards.

Aha. Yeah, don't use a regexp for "the whole layer". I've fetched your file, and it is one instruction or comment per line. This is _easy_ to parse. Consider this totally untested sketch: layer_re = re.compile('^; layer (\d+), Z = (.*)') with open("JPL.gcode") as gcode: current_layer = None for lineno, line in enumerate(gcode, 1): m = layer_re.match(line) if m: # new layer new_layer = int(m.group(1)) new_z = float(m.group(2)) if current_layer is not None: # process the saved previous layer .......... current_layer = new_layer accrued_layer = [] if current_layer is not None: # we're saving lines for later work accrued_layer.append(line) continue # otherwise, copy the line straight out sys.stdout.write(line) The idea is that you scan the data on a per-line basis, adjusting some state variables as you see important lines. If you're "saving" a chunk of lines such as the instructions in a layer (in the above code: "current_layer is not None") you can stuff just those lines into a list for use when complete. On changes of state you deal with what you may have saved, etc. But just looking at your examples, you may not need to save anything; just insert or append lines during the copy. Example: with open("JPL.gcode") as gcode: for lineno, line in enumerate(gcode, 1): # pre line actions if line.startswith('; process '): print("M600 instruction...") # copy out the line sys.stdout.write(line) # post line actions if ... So you don't need to apply a regexp to a huge chunk of file. Process the file on an instruction basis and insert/append your extra instructions as you see the boundaries of the code you're after. A minor note. This incantation: for lineno, line in enumerate(gcode, 1): is to make it easy to print error message which recite the file line number to aid debugging. If you don't need that you'd just run with: for line in gcode: Cheers, Cameron Simpson <cs@cskk.id.au>

Ram Rachum

1:11 p.m.

Hi Cameron, Thanks for putting in the time to study my problem and sketch a solution. Unfortunately, it's not helpful. I was developing a solution similar to yours before I came to the conclusion that a multilne regex would be more elegant. I find this algorithm to be quite complicated. It's basically a poor man's regex engine. I'm more likely to use a shim to make the re package work on streams (like regexy or reading chunks until I get a match) than to use an algorithm like that. Thanks, Ram. On Sun, Oct 7, 2018 at 9:58 AM Cameron Simpson <cs@cskk.id.au> wrote:

...

...
I'm doing multi-color 3d-printing. The slicing software generates a GCode file, which is a text file of instructions for the printer, each command meaning something like "move the head to coordinates x,y,z while extruding plastic at a rate of w" and lots of other administrative commands. (Turn the print fan on, heat the extruder to a certain temperature, calibrate

On 07Oct2018 07:30, Ram Rachum <ram@rachum.com> wrote: the

...
printer, etc.)

Here's an example of a simple GCode from a print I did last week: https://www.dropbox.com/s/kzmm6v8ilcn0aik/JPL%20Go%20hook.gcode?dl=0

It's 1.8MB in size. They could get to 1GB for complex prints.

Multi-color prints means that at some points in the print, usually in a layer change, I'm changing the color. This means I need to insert an M600 command, which tells the printer to pause the print, move the head around, and give me a prompt to change the filament before continuing printing.

I'm sick of injecting the M600 manually after every print. I've been doing that for the last year. I'm working on a script so I could say "Insert an M600 command after layers 5, 88 and 234, and also before process Foo."

The slicer inserts comments saying "; layer 234" Or "; process Foo". I want to identify the entire layer as one match. That's because I want to find the layer and possibly process at the start, I want to find the last retraction command, the first extrusion command in the new layer, etc. So it's a regex that spans potentially thousands of lines.

Then I'll know just where to put my M600 and how much retraction to do afterwards.

Aha.

Yeah, don't use a regexp for "the whole layer". I've fetched your file, and it is one instruction or comment per line. This is _easy_ to parse. Consider this totally untested sketch:

layer_re = re.compile('^; layer (\d+), Z = (.*)') with open("JPL.gcode") as gcode: current_layer = None for lineno, line in enumerate(gcode, 1): m = layer_re.match(line) if m: # new layer new_layer = int(m.group(1)) new_z = float(m.group(2)) if current_layer is not None: # process the saved previous layer .......... current_layer = new_layer accrued_layer = [] if current_layer is not None: # we're saving lines for later work accrued_layer.append(line) continue # otherwise, copy the line straight out sys.stdout.write(line)

The idea is that you scan the data on a per-line basis, adjusting some state variables as you see important lines. If you're "saving" a chunk of lines such as the instructions in a layer (in the above code: "current_layer is not None") you can stuff just those lines into a list for use when complete.

On changes of state you deal with what you may have saved, etc.

But just looking at your examples, you may not need to save anything; just insert or append lines during the copy. Example:

with open("JPL.gcode") as gcode: for lineno, line in enumerate(gcode, 1): # pre line actions if line.startswith('; process '): print("M600 instruction...") # copy out the line sys.stdout.write(line) # post line actions if ...

So you don't need to apply a regexp to a huge chunk of file. Process the file on an instruction basis and insert/append your extra instructions as you see the boundaries of the code you're after.

A minor note. This incantation:

for lineno, line in enumerate(gcode, 1):

is to make it easy to print error message which recite the file line number to aid debugging. If you don't need that you'd just run with:

for line in gcode:

Cheers, Cameron Simpson <cs@cskk.id.au>

2015＠jmunch.dk

1:24 p.m.

On 18-10-07 15.11, Ram Rachum wrote:

...

Unfortunately, it's not helpful. I was developing a solution similar to yours before I came to the conclusion that a multilne regex would be more elegant.

How about memory mapping your 1GB file? bytes patterns work on memoryviews. regards, Anders

Ram Rachum

2:15 p.m.

I tested it now and indeed bytes patterns work on memoryview objects. But how do I use this to scan for patterns through a stream without loading it to memory? On Sun, Oct 7, 2018 at 4:24 PM <2015@jmunch.dk> wrote:

...

On 18-10-07 15.11, Ram Rachum wrote:

...
Unfortunately, it's not helpful. I was developing a solution similar to yours before I came to the conclusion that a multilne regex would be more elegant.

How about memory mapping your 1GB file?

bytes patterns work on memoryviews.

regards, Anders

2015＠jmunch.dk

4:45 p.m.

On 18-10-07 16.15, Ram Rachum wrote:

...

I tested it now and indeed bytes patterns work on memoryview objects. But how do I use this to scan for patterns through a stream without loading it to memory?

An mmap object is one of the things you can make a memoryview of, although looking again, it seems you don't even need to, you can just re.search the mmap object directly. re.search'ing the mmap object means the operating system takes care of the streaming for you, reading in parts of the file only as necessary. regards, Anders

Jonathan Fine

5:13 p.m.

Anders wrote

...

An mmap object is one of the things you can make a memoryview of, although looking again, it seems you don't even need to, you can just re.search the mmap object directly.

re.search'ing the mmap object means the operating system takes care of the streaming for you, reading in parts of the file only as necessary.

Provided mmap releases memory when possible, this is a good solution, and probably in many cases better than the one I suggested. And closer to what the original poster asked for. So I've learnt something. Thank you all. -- Jonathan

Greg Ewing

9:32 p.m.

Jonathan Fine wrote:

...

Provided mmap releases memory when possible,

It will. The virtual memory system will read pages from the file into RAM when needed, and re-use those RAM pages for other purposes when needed. It should be pretty much the most efficient solution possible. -- Greg

Ram Rachum

7:56 a.m.

That's incredibly interesting. I've never used mmap before. However, there's a problem. I did a few experiments with mmap now, this is the latest: path = pathlib.Path(r'P:\huge_file') with path.open('r') as file: mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) for match in re.finditer(b'.', mmap): pass The file is 338GB in size, and it seems that Python is trying to load it into memory. The process is now taking 4GB RAM and it's growing. I saw the same behavior when searching for a non-existing match. Should I open a Python bug for this? On Sun, Oct 7, 2018 at 7:49 PM <2015@jmunch.dk> wrote:

...

On 18-10-07 16.15, Ram Rachum wrote:

...
I tested it now and indeed bytes patterns work on memoryview objects. But how do I use this to scan for patterns through a stream without loading it to memory?

An mmap object is one of the things you can make a memoryview of, although looking again, it seems you don't even need to, you can just re.search the mmap object directly.

re.search'ing the mmap object means the operating system takes care of the streaming for you, reading in parts of the file only as necessary.

regards, Anders

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Cameron Simpson

10:19 a.m.

On 08Oct2018 10:56, Ram Rachum <ram@rachum.com> wrote:

...

That's incredibly interesting. I've never used mmap before. However, there's a problem. I did a few experiments with mmap now, this is the latest:

path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file: mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)

Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?

...

for match in re.finditer(b'.', mmap): pass

The file is 338GB in size, and it seems that Python is trying to load it into memory. The process is now taking 4GB RAM and it's growing. I saw the same behavior when searching for a non-existing match.

Should I open a Python bug for this?

Probably not. First figure out what is going on. BTW, how much RAM have you got? As you access the mapped file the OS will try to keep it in memory in case you need that again. In the absense of competition, most stuff will get paged out to accomodate it. That's normal. All the data are "clean" (unmodified) so the OS can simply release the older pages instantly if something else needs the RAM. However, another possibility is the the regexp is consuming lots of memory. The regexp seems simple enough (b'.'), so I doubt it is leaking memory like mad; I'm guessing you're just seeing the OS page in as much of the file as it can. Also, does the loop iterate? i.e. does it find multiple matches as the memory gets consumed, or is the first iateration blocking and consuming gobs of memory before the first match comes back? A print() call will tell you that. Cheers, Cameron Simpson <cs@cskk.id.au>

Ram Rachum

10:36 a.m.

I'm not an expert on memory. I used Process Explorer to look at the Process. The Working Set of the current run is 11GB. The Private Bytes is 708MB. Actually, see all the info here: https://www.dropbox.com/s/tzoud028pzdkfi7/screenshot_TURING_2018-10-08_13335... I've got 16GB of RAM on this computer, and Process Explorer says it's almost full, just ~150MB left. This is physical memory. To your question: The loop does iterate, i.e. finding multiple matches. On Mon, Oct 8, 2018 at 1:20 PM Cameron Simpson <cs@cskk.id.au> wrote:

...

On 08Oct2018 10:56, Ram Rachum <ram@rachum.com> wrote:

...
That's incredibly interesting. I've never used mmap before. However, there's a problem. I did a few experiments with mmap now, this is the latest:

path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file: mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)

Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?

...
for match in re.finditer(b'.', mmap): pass

The file is 338GB in size, and it seems that Python is trying to load it into memory. The process is now taking 4GB RAM and it's growing. I saw the same behavior when searching for a non-existing match.

Should I open a Python bug for this?

Probably not. First figure out what is going on. BTW, how much RAM have you got?

As you access the mapped file the OS will try to keep it in memory in case you need that again. In the absense of competition, most stuff will get paged out to accomodate it. That's normal. All the data are "clean" (unmodified) so the OS can simply release the older pages instantly if something else needs the RAM.

However, another possibility is the the regexp is consuming lots of memory.

The regexp seems simple enough (b'.'), so I doubt it is leaking memory like mad; I'm guessing you're just seeing the OS page in as much of the file as it can.

Also, does the loop iterate? i.e. does it find multiple matches as the memory gets consumed, or is the first iateration blocking and consuming gobs of memory before the first match comes back? A print() call will tell you that.

Cheers, Cameron Simpson <cs@cskk.id.au>

cs＠cskk.id.au

8:30 p.m.

On 08Oct2018 13:36, Ram Rachum <ram@rachum.com> wrote:

...

I'm not an expert on memory. I used Process Explorer to look at the Process. The Working Set of the current run is 11GB. The Private Bytes is 708MB. Actually, see all the info here: https://www.dropbox.com/s/tzoud028pzdkfi7/screenshot_TURING_2018-10-08_13335...

And the process' virtual size is about 353GB, which matches having your file mmaped (its contents is now part of your process virtual memory space).

...

I've got 16GB of RAM on this computer, and Process Explorer says it's almost full, just ~150MB left. This is physical memory.

I'd say this is expected behaviour. As you access the memory it is paged into physical memory, and since it may be wanted again (the OS can't tell) it isn't paged out until that becomes necessary to make room for other virtual pages. I suspect (but would need to test to find out) that sequentially reading the file instead of memory mapping it might not be so aggressive because your process would be reusing that same small pool of memory to hold data as you scan the file. Cheers, Cameron Simpson <cs@cskk.id.au>

Erik Bray

12:02 p.m.

On Mon, Oct 8, 2018 at 12:20 PM Cameron Simpson <cs@cskk.id.au> wrote:

...

On 08Oct2018 10:56, Ram Rachum <ram@rachum.com> wrote:

...
That's incredibly interesting. I've never used mmap before. However, there's a problem. I did a few experiments with mmap now, this is the latest:

path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file: mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)

Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?

...
for match in re.finditer(b'.', mmap): pass

The file is 338GB in size, and it seems that Python is trying to load it into memory. The process is now taking 4GB RAM and it's growing. I saw the same behavior when searching for a non-existing match.

Should I open a Python bug for this?

Probably not. First figure out what is going on. BTW, how much RAM have you got?

As you access the mapped file the OS will try to keep it in memory in case you need that again. In the absense of competition, most stuff will get paged out to accomodate it. That's normal. All the data are "clean" (unmodified) so the OS can simply release the older pages instantly if something else needs the RAM.

However, another possibility is the the regexp is consuming lots of memory.

The regexp seems simple enough (b'.'), so I doubt it is leaking memory like mad; I'm guessing you're just seeing the OS page in as much of the file as it can.

Yup. Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused. For read-only, non-anonymous mappings it's not much problem for the OS to drop pages that haven't been recently accessed and use them for something else. So I wouldn't be too worried about the process chewing up RAM. I feel like this is veering more into python-list territory for further discussion though.

Ram Rachum

12:11 p.m.

" Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused." Two questions: 1. Is the "why not" sarcastic, as in you're agreeing it's a waste? 2. Will this be different on Linux? Which command do I run on Linux to verify that the process isn't taking too much RAM? Thanks, Ram. On Mon, Oct 8, 2018 at 3:02 PM Erik Bray <erik.m.bray@gmail.com> wrote:

...

On Mon, Oct 8, 2018 at 12:20 PM Cameron Simpson <cs@cskk.id.au> wrote:

...
On 08Oct2018 10:56, Ram Rachum <ram@rachum.com> wrote:

...
That's incredibly interesting. I've never used mmap before. However, there's a problem. I did a few experiments with mmap now, this is the latest:

path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file: mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)

Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?

...
for match in re.finditer(b'.', mmap): pass

The file is 338GB in size, and it seems that Python is trying to load it into memory. The process is now taking 4GB RAM and it's growing. I saw

...
...
same behavior when searching for a non-existing match.

Should I open a Python bug for this?

Probably not. First figure out what is going on. BTW, how much RAM have you got?

As you access the mapped file the OS will try to keep it in memory in case you need that again. In the absense of competition, most stuff will get

...
to accomodate it. That's normal. All the data are "clean" (unmodified) so the OS can simply release the older pages instantly if something else needs

...
RAM.

However, another possibility is the the regexp is consuming lots of memory.

The regexp seems simple enough (b'.'), so I doubt it is leaking memory

the paged out the like

...
mad; I'm guessing you're just seeing the OS page in as much of the file as it can.

Yup. Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused. For read-only, non-anonymous mappings it's not much problem for the OS to drop pages that haven't been recently accessed and use them for something else. So I wouldn't be too worried about the process chewing up RAM.

I feel like this is veering more into python-list territory for further discussion though.

Richard Damon

12:25 p.m.

On 10/8/18 8:11 AM, Ram Rachum wrote:

...

" Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused."

Two questions:

1. Is the "why not" sarcastic, as in you're agreeing it's a waste? 2. Will this be different on Linux? Which command do I run on Linux to verify that the process isn't taking too much RAM?

Thanks, Ram. I would say the 'why not' isn't being sarcastic but pragmatic. (And I would expect Linux to work similarly). After all if you have a system with X amount of memory, and total memory demand for the other processes is 10% of X, what is the issue with letting one process use 80% of X with memory usages that is easy to clear out if something else wants it. A read only page that is already backed on the disk is trivial to make available for another usage.

Memory just sitting idle is the real waste. -- Richard Damon

Ram Rachum

12:29 p.m.

Thanks for your help everybody! I'm very happy to have learned about mmap. On Mon, Oct 8, 2018 at 3:27 PM Richard Damon <Richard@damon-family.org> wrote:

...

On 10/8/18 8:11 AM, Ram Rachum wrote:

...
" Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused."

Two questions:

1. Is the "why not" sarcastic, as in you're agreeing it's a waste? 2. Will this be different on Linux? Which command do I run on Linux to verify that the process isn't taking too much RAM?

Thanks, Ram. I would say the 'why not' isn't being sarcastic but pragmatic. (And I would expect Linux to work similarly). After all if you have a system with X amount of memory, and total memory demand for the other processes is 10% of X, what is the issue with letting one process use 80% of X with memory usages that is easy to clear out if something else wants it. A read only page that is already backed on the disk is trivial to make available for another usage.

Memory just sitting idle is the real waste.

-- Richard Damon

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Anders Hovmöller

12:15 p.m.

...

...
However, another possibility is the the regexp is consuming lots of memory.

The regexp seems simple enough (b'.'), so I doubt it is leaking memory like mad; I'm guessing you're just seeing the OS page in as much of the file as it can.

Yup. Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused. For read-only, non-anonymous mappings it's not much problem for the OS to drop pages that haven't been recently accessed and use them for something else. So I wouldn't be too worried about the process chewing up RAM.

I feel like this is veering more into python-list territory for further discussion though.

Last time I worked on windows, which admittedly was a long time, the file cache was not attributed to a process, so this doesn't seem to be relevant to this situation. / Anders

Chris Angelico

2:28 p.m.

On Mon, Oct 8, 2018 at 11:15 PM Anders Hovmöller <boxed@killingar.net> wrote:

...

However, another possibility is the the regexp is consuming lots of memory.

The regexp seems simple enough (b'.'), so I doubt it is leaking memory like mad; I'm guessing you're just seeing the OS page in as much of the file as it can.

Yup. Windows will aggressively fill up your RAM in cases like this because after all why not? There's no use to having memory just sitting around unused. For read-only, non-anonymous mappings it's not much problem for the OS to drop pages that haven't been recently accessed and use them for something else. So I wouldn't be too worried about the process chewing up RAM.

I feel like this is veering more into python-list territory for further discussion though.

Last time I worked on windows, which admittedly was a long time, the file cache was not attributed to a process, so this doesn't seem to be relevant to this situation.

Depends whether it's a file cache or a memory-mapped file, though. On Linux, if I open a file, read it, then close it, I'm not using that file any more, but it might remain in cache (which will mean that re-reading it will be fast, regardless of whether that's from the same or a different process). That usage shows up as either "buffers" or "cache", and doesn't belong to any process. In contrast, a mmap'd file is memory that you do indeed own. If the system runs short of physical memory, it can simply discard those pages (rather than saving them to the swap file), but they're still owned by one specific process, and should count in that process's virtual memory. (That's based on my knowledge of Linux today and OS/2 back in the 90s. It may or may not be accurate to Windows, but I suspect it won't be very far wrong.) ChrisA

Greg Ewing

11:04 a.m.

Chris Angelico wrote:

...

In contrast, a mmap'd file is memory that you do indeed own.

Although it's not really accurate to say that it's owned by a particular process. If two processes mmap the same file, the physical memory pages holding it appear in the address spaces of both processes. -- Greg

Chris Angelico

11:10 a.m.

On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

Chris Angelico wrote:

...
In contrast, a mmap'd file is memory that you do indeed own.

Although it's not really accurate to say that it's owned by a particular process. If two processes mmap the same file, the physical memory pages holding it appear in the address spaces of both processes.

Yeah, which is a constant problem when you ask "why am I out of memory". Same thing happens if you have any other (non-mmap'd) pages and then you fork, or if two processes are using the same executable (though I think that's actually mmap again), etc, etc, etc. Tell me, which process is responsible for libc being in memory? Other than, like, all of them? ChrisA

Stephen J. Turnbull

3:42 p.m.

Chris Angelico writes:

...

On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...
Chris Angelico wrote:

...
In contrast, a mmap'd file is memory that you do indeed own.

Although it's not really accurate to say that it's owned by a particular process. If two processes mmap the same file, the physical memory pages holding it appear in the address spaces of both processes.

Subject to COW, I presume. Probably in units smaller than the whole file (per page?)

...

Tell me, which process is responsible for libc being in memory? Other than, like, all of them?

Yes. Why would you want a different answer?

Chris Angelico

3:46 p.m.

On Wed, Oct 10, 2018 at 2:42 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:

...

Chris Angelico writes:

...
On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...
Chris Angelico wrote:

...
In contrast, a mmap'd file is memory that you do indeed own.

Although it's not really accurate to say that it's owned by a particular process. If two processes mmap the same file, the physical memory pages holding it appear in the address spaces of both processes.

Subject to COW, I presume. Probably in units smaller than the whole file (per page?)

Both processes are using the virtual memory. Either or both could be using physical memory. Assuming they haven't written to the pages (which is the case with executables - the system mmaps the binary into your memory space as read-only), and assuming that those pages are backed by physical memory, which process is using that memory?

...

...
Tell me, which process is responsible for libc being in memory? Other than, like, all of them?

Yes. Why would you want a different answer?

Because that would mean that I have way more *physical* memory in use than I actually have chips on the motherboard for. Yes, virtual memory can be over-allocated, but physical memory can't. How do you measure where your physical memory is being used, if adding up the usage of all processes exceeds the available resources? It's like trying to figure out which countries the people of this planet live in, and saying that there are X billion in China, Y million in Australia, etc, etc, etc, and then discovering that you're counting people multiple times if they ever move... which means that there are ten or twenty billion people on the planet. ChrisA

Stephen J. Turnbull

6:09 p.m.

Chris Angelico writes:

...

Both processes are using the virtual memory. Either or both could be using physical memory. Assuming they haven't written to the pages (which is the case with executables - the system mmaps the binary into your memory space as read-only), and assuming that those pages are backed by physical memory, which process is using that memory?

One doesn't know. Clever side-channel attacks aside, I don't care, and I don't see how it could matter.

...

...
...
Tell me, which process is responsible for libc being in memory? Other than, like, all of them?

Yes. Why would you want a different answer?

Because that would mean that I have way more *physical* memory in use than I actually have chips on the motherboard for.

No, that's like saying that because you have multiple links to a file on disk you're using more physical disk than you have.

...

Yes, virtual memory can be over-allocated, but physical memory can't. How do you measure where your physical memory is being used, if adding up the usage of all processes exceeds the available resources?

lsof can do it, I guess. The kernel knows which pages of which processes' virtual memory are backed by physical memory. But just as running over the array of inodes tells you how much physical disk is allocated, without looking at directories to find out what it's allocated to, the kernel probably has a bitmap or similar so it knows which pages of physical memory are in use. All problems are easy once you have enough indirection. :^)

Chris Angelico

6:12 p.m.

On Wed, Oct 10, 2018 at 5:09 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:

...

Chris Angelico writes:

...
Both processes are using the virtual memory. Either or both could be using physical memory. Assuming they haven't written to the pages (which is the case with executables - the system mmaps the binary into your memory space as read-only), and assuming that those pages are backed by physical memory, which process is using that memory?

One doesn't know. Clever side-channel attacks aside, I don't care, and I don't see how it could matter.

It matters a lot when you're trying to figure out what your system is doing.

...

...
...
...
Tell me, which process is responsible for libc being in memory? Other than, like, all of them?

Yes. Why would you want a different answer?

Because that would mean that I have way more *physical* memory in use than I actually have chips on the motherboard for.

No, that's like saying that because you have multiple links to a file on disk you're using more physical disk than you have.

Actually, that's exactly the same problem, with exactly the same consequences. How do you figure out why your disk is full? How do you enforce disk quotas? How can you get any sort of reasonable metrics about anything when the sum of everything vastly exceeds the actual usage? ChrisA

Stephen J. Turnbull

5:27 a.m.

Chris Angelico writes:

...

On Wed, Oct 10, 2018 at 5:09 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:

...
Chris Angelico writes:

...
Both processes are using the virtual memory. Either or both could be using physical memory. Assuming they haven't written to the pages (which is the case with executables - the system mmaps the binary into your memory space as read-only), and assuming that those pages are backed by physical memory, which process is using that memory?

One doesn't know. Clever side-channel attacks aside, I don't care, and I don't see how it could matter.

It matters a lot when you're trying to figure out what your system is doing.

Sure, but knowing how your system works is far more important. Eg, create a 1TB file on a POSIX system, delete it while a process still has it opened, and it doesn't matter how you process the output of du or ls, you still have 1TB of used file space not accounted for. The same applies to swapfiles. But "df" knows and will tell you. In fact, "ps" will tell you how much shared memory a process is using. I just don't see a problem here, on the "I'm not getting the data I need" side. You do have access to the data you need.

...

...
...
...
...
Tell me, which process is responsible for libc being in memory? Other than, like, all of them?

Yes. Why would you want a different answer?

Because that would mean that I have way more *physical* memory in use than I actually have chips on the motherboard for.

No, that's like saying that because you have multiple links to a file on disk you're using more physical disk than you have.

Actually, that's exactly the same problem, with exactly the same consequences. How do you figure out why your disk is full? How do you enforce disk quotas? How can you get any sort of reasonable metrics about anything when the sum of everything vastly exceeds the actual usage?

You add up the right things, of course, and avoid paradoxes. The disk quota enforcement problem is indeed hard. This sounds to me like a special problem studied in cost accounting, a problem which was solved (a computation that satisfies certain axioms was shown to exist and be unique) in a sense by Aumann and Shapley in the 1970s. The A-S prices have been used by telephone carriers to allocate costs of fixed assets with capacity constraints to individual calls, though I don't know if the method is still in use. I'm not sure if the disk quota problem fits the A-S theorem (which imposes certain monotonicity conditions), but the general paradigm does. However, the quota problem (and in general, the problem of allocation of overhead) is "hard" even if you have complete information about the system, because it's a values problem: what events are bad? what events are worse? what events are unacceptable (result in bankruptcy and abandonment of the system)? Getting very complete, accurate information about the physical consequences of individual events in the system (linking to a file on disk, allocating a large quantity of viritual memory) is not difficult, in the sense that you throw money and engineers at it, and you get "df". Getting very complete, accurate information about the values you're trying to satisfy is possible only for an omniscient god, even if, as in business, they can be measured in currency units. Steve

James Lu

2:17 p.m.

Greg Ewing

8:58 p.m.

Stephen J. Turnbull wrote:

...

Subject to COW, I presume. Probably in units smaller than the whole file (per page?)

It can be COW or not, depending on the options passed to mmap. And yes, it's mapped in units of pages. -- Greg

Cameron Simpson

9:25 p.m.

On 10Oct2018 00:42, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:

...

Chris Angelico writes:

...
On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...
Chris Angelico wrote:

...
In contrast, a mmap'd file is memory that you do indeed own.

Although it's not really accurate to say that it's owned by a particular process. If two processes mmap the same file, the physical memory pages holding it appear in the address spaces of both processes.

Subject to COW, I presume. Probably in units smaller than the whole file (per page?)

Yes, pages (whatever using the memory paging service works in). But not COW. Changes to the memory affect the file contents and are visible to the other process. It is mapped, not pseudocopied. Cheers, Cameron Simpson <cs@cskk.id.au>

Nathaniel Smith

9 p.m.

On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:

...

I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

What do you think?

This has frustrated me too. The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n". So our requirements are: 1. Search a bytearray for the regex b"\r\n\r\n|\n\n" 2. If there's no match yet, wait for more data to arrive and try again 3. When more data arrives, start searching again *where the last search left off* The last requirement is subtle, but important. The naive approach would be to rescan your whole receive buffer after each new packet arrives: end_of_headers = re.compile(b"\r\n\r\n|\n\n") while True: m = end_of_headers.search(receive_buffer) if m is None: receive_buffer += await get_more_data_from_network() # loop around and try again else: break But the code above is quadratic! If the headers are N bytes long, then on each pass through the loop we perform an O(N) regex search, and we do O(N) passes through the loop, so the whole thing is O(N**2). That means your HTTP client-or-server can be trivially DoSed by a peer who sends their headers broken into lots of small fragments. Fortunately, there's an elegant and natural solution: Just save the regex engine's internal state when it hits the end of the string, and then when more data arrives, use the saved state to pick up the search where we left off. Theoretically, any regex engine *could* support this – it's especially obvious for DFA-based matchers, but even backtrackers like Python's re could support it, basically by making the matching engine a coroutine that can suspend itself when it hits the end of the input, then resume it when new input arrives. Like, if you asked Knuth for the theoretically optimal design for this parser, I'm pretty sure this is what he'd tell you to use, and it's what people do when writing high-performance HTTP parsers in C. But unfortunately, in reality, re *doesn't* support this kind of pause/resume functionality, and you can't write efficient character-by-character algorithms in Python, so you have to use really awkward hacks instead. For the HTTP header case, the best I've been able to come up with is to manually analyze the regex to figure out the maximum size string it could match (in this case, 4 bytes), and then write a loop that tracks how long the string was before the last time we appended new data, and on each iteration searches the substring receive_buffer[old_length - 4 + 1:]. This is super finicky, and especially annoying if you want to offer this as a generic API for using regexes to deconstruct network streams. (There are a lot of Python network libraries that have accidentally-quadratic parsers in them.) In practice I suspect retrofitting this functionality into 're' would be a lot of work. But it's definitely frustrating that we have 90% of the machinery we'd need to do things the natural/efficient way, but then are thwarted by this arbitrary API limitation. -n -- Nathaniel J. Smith -- https://vorpus.org

Chris Angelico

9:04 p.m.

On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <njs@pobox.com> wrote:

...

On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n" 2. If there's no match yet, wait for more data to arrive and try again 3. When more data arrives, start searching again *where the last search left off*

The last requirement is subtle, but important. The naive approach would be to rescan your whole receive buffer after each new packet arrives:

end_of_headers = re.compile(b"\r\n\r\n|\n\n") while True: m = end_of_headers.search(receive_buffer) if m is None: receive_buffer += await get_more_data_from_network() # loop around and try again else: break

But the code above is quadratic! If the headers are N bytes long, then on each pass through the loop we perform an O(N) regex search, and we do O(N) passes through the loop, so the whole thing is O(N**2). That means your HTTP client-or-server can be trivially DoSed by a peer who sends their headers broken into lots of small fragments.

Quadratic in the size of the headers only, so you could just cap it - if the receive buffer is too large, just reject it. Sometimes, the simplest approach is the best. ChrisA

Nathaniel Smith

10:54 p.m.

On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico <rosuav@gmail.com> wrote:

...

On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <njs@pobox.com> wrote:

...
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n" 2. If there's no match yet, wait for more data to arrive and try again 3. When more data arrives, start searching again *where the last search left off*

The last requirement is subtle, but important. The naive approach would be to rescan your whole receive buffer after each new packet arrives:

end_of_headers = re.compile(b"\r\n\r\n|\n\n") while True: m = end_of_headers.search(receive_buffer) if m is None: receive_buffer += await get_more_data_from_network() # loop around and try again else: break

But the code above is quadratic! If the headers are N bytes long, then on each pass through the loop we perform an O(N) regex search, and we do O(N) passes through the loop, so the whole thing is O(N**2). That means your HTTP client-or-server can be trivially DoSed by a peer who sends their headers broken into lots of small fragments.

Quadratic in the size of the headers only, so you could just cap it - if the receive buffer is too large, just reject it. Sometimes, the simplest approach is the best.

But OTOH, every problem has a solution that's simple, obvious, and wrong :-). Of course you set a cap on the header size, to prevent other kinds of DoS (e.g. memory exhaustion). But it turns out people stuff a lot of data into HTTP headers [1], so if the cap is large enough to support non-malicious usage, then it's also large enough to let people DoS the naive O(N**2) algorithm. Production-quality HTTP/1.1 parsers really do have to use an O(N) algorithm here. And similarly, if you're building a generic helper library for people implementing arbitrary unknown protocols, then you can't assume their protocols were designed to use small frames only, to avoid hitting arbitrary limitations in Python's re module. -n [1] E.g., 16 KiB total header size is already enough that on my laptop, the naive O(N**2) algorithm takes ~750 ms CPU time, versus ~16 ms for the O(N) algorithm. HTTP/2 tried hard to simplify their header encoding scheme by putting limits on header size, but got so much push-back that they were eventually forced to add special hacks to allow for arbitrarily large headers – in particular, it turns out that people use *individual cookies* that are larger than 16 KiB: https://http2.github.io/faq/#why-the-rules-around-continuation-on-headers-fr... -- Nathaniel J. Smith -- https://vorpus.org

Chris Angelico

10:56 p.m.

On Sun, Oct 7, 2018 at 9:54 AM Nathaniel Smith <njs@pobox.com> wrote:

...

On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico <rosuav@gmail.com> wrote:

...
On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <njs@pobox.com> wrote:

...
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n" 2. If there's no match yet, wait for more data to arrive and try again 3. When more data arrives, start searching again *where the last search left off*

The last requirement is subtle, but important. The naive approach would be to rescan your whole receive buffer after each new packet arrives:

end_of_headers = re.compile(b"\r\n\r\n|\n\n") while True: m = end_of_headers.search(receive_buffer) if m is None: receive_buffer += await get_more_data_from_network() # loop around and try again else: break

But the code above is quadratic! If the headers are N bytes long, then on each pass through the loop we perform an O(N) regex search, and we do O(N) passes through the loop, so the whole thing is O(N**2). That means your HTTP client-or-server can be trivially DoSed by a peer who sends their headers broken into lots of small fragments.

Quadratic in the size of the headers only, so you could just cap it - if the receive buffer is too large, just reject it. Sometimes, the simplest approach is the best.

But OTOH, every problem has a solution that's simple, obvious, and wrong :-).

Of course you set a cap on the header size, to prevent other kinds of DoS (e.g. memory exhaustion). But it turns out people stuff a lot of data into HTTP headers [1], so if the cap is large enough to support non-malicious usage, then it's also large enough to let people DoS the naive O(N**2) algorithm. Production-quality HTTP/1.1 parsers really do have to use an O(N) algorithm here.

Fair enough! You could instead use a simple linear parser rather than searching for end of headers first. That way, all you're looking for is a \n, no regex needed. Sometimes the VERY simplest solution doesn't work, but I still do like to keep things as simple as possible :) ChrisA

Steven D'Aprano

1:39 a.m.

On Sat, Oct 06, 2018 at 02:00:27PM -0700, Nathaniel Smith wrote:

...

Fortunately, there's an elegant and natural solution: Just save the regex engine's internal state when it hits the end of the string, and then when more data arrives, use the saved state to pick up the search where we left off. Theoretically, any regex engine *could* support this – it's especially obvious for DFA-based matchers, but even backtrackers like Python's re could support it, basically by making the matching engine a coroutine that can suspend itself when it hits the end of the input, then resume it when new input arrives. Like, if you asked Knuth for the theoretically optimal design for this parser, I'm pretty sure this is what he'd tell you to use, and it's what people do when writing high-performance HTTP parsers in C.

The message I take from this is: - regex engines certainly can be written to support streaming data; - but few of them are; - and it is exceedingly unlikely to be able to easily (or at all) retro-fit that support to Python's existing re module. Perhaps the solution is a lightweight streaming DFA regex parser? Does anyone know whether MRAB's regex library supports this? https://pypi.org/project/regex/

...

you can't write efficient character-by-character algorithms in Python

I'm sure that Python will never be as efficient as C in that regard (although PyPy might argue the point) but is there something we can do to ameliorate this? If we could make char-by-char processing only 10 times less efficient than C instead of 100 times (let's say...) perhaps that would help Ram (and you?) with your use-cases? -- Steve

Ram Rachum

4:32 a.m.

On Sun, Oct 7, 2018 at 4:40 AM Steven D'Aprano <steve@pearwood.info> wrote:

...

I'm sure that Python will never be as efficient as C in that regard (although PyPy might argue the point) but is there something we can do to ameliorate this? If we could make char-by-char processing only 10 times less efficient than C instead of 100 times (let's say...) perhaps that would help Ram (and you?) with your use-cases?

Does that mean I'll have to write that character-by-character algorithm? I could already do that now I guess, the speed doesn't matter for my use case, but I'm trying to avoid writing an algorithm.

Cameron Simpson

6:59 a.m.

On 07Oct2018 07:32, Ram Rachum <ram@rachum.com> wrote:

...

On Sun, Oct 7, 2018 at 4:40 AM Steven D'Aprano <steve@pearwood.info> wrote:

...
I'm sure that Python will never be as efficient as C in that regard (although PyPy might argue the point) but is there something we can do to ameliorate this? If we could make char-by-char processing only 10 times less efficient than C instead of 100 times (let's say...) perhaps that would help Ram (and you?) with your use-cases?

Does that mean I'll have to write that character-by-character algorithm? I could already do that now I guess, the speed doesn't matter for my use case, but I'm trying to avoid writing an algorithm.

No. You can write a line by line loop. See my other post. Cheers, Cameron Simpson <cs@cskk.id.au>

Terry Reedy

12:14 a.m.

On 10/7/2018 12:32 AM, Ram Rachum wrote:

...

Does that mean I'll have to write that character-by-character algorithm?

I would not be surprised if you could make use of str.index, which scans at C speed. See my answer to Nathaniel. -- Terry Jan Reedy

Nathaniel Smith

12:15 p.m.

On Sat, Oct 6, 2018, 18:40 Steven D'Aprano <steve@pearwood.info> wrote:

...

The message I take from this is:

- regex engines certainly can be written to support streaming data; - but few of them are; - and it is exceedingly unlikely to be able to easily (or at all) retro-fit that support to Python's existing re module.

I don't know enough about the re module internals to make an informed guess about the difficulty. On a quick glance, it does seem to store most intermediate match state in explicit structs rather than on the call stack... -n

...

Terry Reedy

12:09 a.m.

On 10/6/2018 5:00 PM, Nathaniel Smith wrote:

...

On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n"

I believe that re is both overkill and slow for this particular problem. For O(n), search forward for \n with str.index('\n') (or .find) [I assume that this searches forward faster than for i, c in enumerate(s): if c == '\n': break and leave you to test this.] If not found, continue with next chunk of data. If found, look back for \r to determine whether to look forward for \n or \r\n *whenever there is enough data to do so.

...

2. If there's no match yet, wait for more data to arrive and try again 3. When more data arrives, start searching again *where the last search left off*

s.index has an optional start parameter. And keep chunks in a list until you have a match and can join all at once. -- Terry Jan Reedy

Nathaniel Smith

12:54 a.m.

On Sun, Oct 7, 2018 at 5:09 PM, Terry Reedy <tjreedy@udel.edu> wrote:

...

On 10/6/2018 5:00 PM, Nathaniel Smith wrote:

...
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <ram@rachum.com> wrote:

...
I'd like to use the re module to parse a long text file, 1GB in size. I wish that the re module could parse a stream, so I wouldn't have to load the whole thing into memory. I'd like to iterate over matches from the stream without keeping the old matches and input in RAM.

What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n"

I believe that re is both overkill and slow for this particular problem. For O(n), search forward for \n with str.index('\n') (or .find) [I assume that this searches forward faster than for i, c in enumerate(s): if c == '\n': break and leave you to test this.]

If not found, continue with next chunk of data. If found, look back for \r to determine whether to look forward for \n or \r\n *whenever there is enough data to do so.

Are you imagining something roughly like this? (Ignoring chunk boundary handling for the moment.) def find_double_line_end(buf): start = 0 while True: next_idx = buf.index(b"\n", start) if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx - 3:next_idx] == b"\r\n\r": return next_idx start = next_idx + 1 That's much more complicated than using re.search, and on some random HTTP headers I have lying around it benchmarks ~70% slower too. Which makes sense, since we're basically trying to replicate re engine's work by hand in a slower language. BTW, if we only want to find a fixed string like b"\r\n\r\n", then re.search and bytearray.index are almost identical in speed. If you have a problem that can be expressed as a regular expression, then regular expression engines are actually pretty good at solving those :-) -n -- Nathaniel J. Smith -- https://vorpus.org

Nathaniel Smith

1:40 a.m.

On Sun, Oct 7, 2018 at 5:54 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

Are you imagining something roughly like this? (Ignoring chunk boundary handling for the moment.)

def find_double_line_end(buf): start = 0 while True: next_idx = buf.index(b"\n", start) if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx - 3:next_idx] == b"\r\n\r": return next_idx start = next_idx + 1

That's much more complicated than using re.search, and on some random HTTP headers I have lying around it benchmarks ~70% slower too. Which makes sense, since we're basically trying to replicate re engine's work by hand in a slower language.

BTW, if we only want to find a fixed string like b"\r\n\r\n", then re.search and bytearray.index are almost identical in speed. If you have a problem that can be expressed as a regular expression, then regular expression engines are actually pretty good at solving those :-)

Though... here's something strange. Here's another way to search for the first appearance of either \r\n\r\n or \n\n in a bytearray: def find_double_line_end_2(buf): idx1 = buf.find(b"\r\n\r\n") idx2 = buf.find(b"\n\n", 0, idx1) if idx1 == -1: return idx2 elif idx2 == -1: return idx1 else: return min(idx1, idx2) So this is essentially equivalent to our regex (notice they both pick out position 505 as the end of the headers): In [52]: find_double_line_end_2(sample_headers) Out[52]: 505 In [53]: double_line_end_re = re.compile(b"\r\n\r\n|\n\n") In [54]: double_line_end_re.search(sample_headers) Out[54]: <_sre.SRE_Match object; span=(505, 509), match=b'\r\n\r\n'> But, the Python function that calls bytearray.find twice is about ~3x faster than the re module: In [55]: %timeit find_double_line_end_2(sample_headers) 1.18 µs ± 40 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [56]: %timeit double_line_end_re.search(sample_headers) 3.3 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) The regex module is even slower: In [57]: double_line_end_regex = regex.compile(b"\r\n\r\n|\n\n") In [58]: %timeit double_line_end_regex.search(sample_headers) 4.95 µs ± 76.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) -n -- Nathaniel J. Smith -- https://vorpus.org

2313

Age (days ago)

2319

Last active (days ago)

List overview

Download

46 comments

17 participants

participants (17)

2015＠jmunch.dk
Anders Hovmöller
Cameron Simpson
Chris Angelico
cs＠cskk.id.au
Erik Bray
Greg Ewing
James Lu
Jonathan Fine
Nathaniel Smith
Ned Batchelder
Ram Rachum
Richard Damon
Serhiy Storchaka
Stephen J. Turnbull
Steven D'Aprano
Terry Reedy

Support parsing stream with `re`

2015＠jmunch.dk

2015＠jmunch.dk

tags

participants (17)