[Python-ideas] Support parsing stream with `re`

Ram Rachum ram at rachum.com
Mon Oct 8 06:36:04 EDT 2018


I'm not an expert on memory. I used Process Explorer to look at the
Process. The Working Set of the current run is 11GB. The Private Bytes is
708MB. Actually, see all the info here:
https://www.dropbox.com/s/tzoud028pzdkfi7/screenshot_TURING_2018-10-08_133355.jpg?dl=0

I've got 16GB of RAM on this computer, and Process Explorer says it's
almost full, just ~150MB left. This is physical memory.

To your question: The loop does iterate, i.e. finding multiple matches.

On Mon, Oct 8, 2018 at 1:20 PM Cameron Simpson <cs at cskk.id.au> wrote:

> On 08Oct2018 10:56, Ram Rachum <ram at rachum.com> wrote:
> >That's incredibly interesting. I've never used mmap before.
> >However, there's a problem.
> >I did a few experiments with mmap now, this is the latest:
> >
> >path = pathlib.Path(r'P:\huge_file')
> >
> >with path.open('r') as file:
> >    mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
>
> Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?
>
> >    for match in re.finditer(b'.', mmap):
> >        pass
> >
> >The file is 338GB in size, and it seems that Python is trying to load it
> >into memory. The process is now taking 4GB RAM and it's growing. I saw the
> >same behavior when searching for a non-existing match.
> >
> >Should I open a Python bug for this?
>
> Probably not. First figure out what is going on. BTW, how much RAM have
> you
> got?
>
> As you access the mapped file the OS will try to keep it in memory in case
> you
> need that again. In the absense of competition, most stuff will get paged
> out
> to accomodate it. That's normal. All the data are "clean" (unmodified) so
> the
> OS can simply release the older pages instantly if something else needs
> the
> RAM.
>
> However, another possibility is the the regexp is consuming lots of memory.
>
> The regexp seems simple enough (b'.'), so I doubt it is leaking memory
> like
> mad; I'm guessing you're just seeing the OS page in as much of the file as
> it
> can.
>
> Also, does the loop iterate? i.e. does it find multiple matches as the
> memory
> gets consumed, or is the first iateration blocking and consuming gobs of
> memory
> before the first match comes back? A print() call will tell you that.
>
> Cheers,
> Cameron Simpson <cs at cskk.id.au>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20181008/0037710b/attachment.html>


More information about the Python-ideas mailing list