Re: [Python-ideas] Support parsing stream with `re`

8 Oct 2018

      I'm not an expert on memory. I used Process Explorer to look at the
Process. The Working Set of the current run is 11GB. The Private Bytes is
708MB. Actually, see all the info here:
https://www.dropbox.com/s/tzoud028pzdkfi7/screenshot_TURING_2018-10-08_13335...

I've got 16GB of RAM on this computer, and Process Explorer says it's
almost full, just ~150MB left. This is physical memory.

To your question: The loop does iterate, i.e. finding multiple matches.

On Mon, Oct 8, 2018 at 1:20 PM Cameron Simpson  wrote:
...
On 08Oct2018 10:56, Ram Rachum  wrote:
...
That's incredibly interesting. I've never used mmap before.
However, there's a problem.
I did a few experiments with mmap now, this is the latest:
path = pathlib.Path(r'P:\huge_file')
with path.open('r') as file:
   mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?
...
for match in re.finditer(b'.', mmap):
       pass
The file is 338GB in size, and it seems that Python is trying to load it
into memory. The process is now taking 4GB RAM and it's growing. I saw the
same behavior when searching for a non-existing match.
Should I open a Python bug for this?
Probably not. First figure out what is going on. BTW, how much RAM have
you
got?
As you access the mapped file the OS will try to keep it in memory in case
you
need that again. In the absense of competition, most stuff will get paged
out
to accomodate it. That's normal. All the data are "clean" (unmodified) so
the
OS can simply release the older pages instantly if something else needs
the
RAM.
However, another possibility is the the regexp is consuming lots of memory.
The regexp seems simple enough (b'.'), so I doubt it is leaking memory
like
mad; I'm guessing you're just seeing the OS page in as much of the file as
it
can.
Also, does the loop iterate? i.e. does it find multiple matches as the
memory
gets consumed, or is the first iateration blocking and consuming gobs of
memory
before the first match comes back? A print() call will tell you that.
Cheers,
Cameron Simpson