[Python-ideas] Support parsing stream with `re`

Mon Oct 8 03:56:15 EDT 2018

That's incredibly interesting. I've never used mmap before.

However, there's a problem.

I did a few experiments with mmap now, this is the latest:

path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file:
    mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
    for match in re.finditer(b'.', mmap):
        pass

The file is 338GB in size, and it seems that Python is trying to load it
into memory. The process is now taking 4GB RAM and it's growing. I saw the
same behavior when searching for a non-existing match.

Should I open a Python bug for this?

On Sun, Oct 7, 2018 at 7:49 PM <2015 at jmunch.dk> wrote:

> On 18-10-07 16.15, Ram Rachum wrote:
>  > I tested it now and indeed bytes patterns work on memoryview objects.
>  > But how do I use this to scan for patterns through a stream without
>  > loading it to memory?
>
> An mmap object is one of the things you can make a memoryview of,
> although looking again, it seems you don't even need to, you can
> just re.search the mmap object directly.
>
> re.search'ing the mmap object means the operating system takes care of
> the streaming for you, reading in parts of the file only as necessary.
>
> regards, Anders
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20181008/a260385d/attachment.html>