Mailman 3 writing a known-size 1D ndarray serially as it's calced - NumPy-Discussion - python.org

newer
An extension of the .npy file...

writing a known-size 1D ndarray serially as it's calced

older
help wanted to review Portuguese...

bross_phobrain＠sonic.net

23 Aug 2022 23 Aug '22

7:22 p.m.

I want to calc multiple ndarrays at once and lack memory, so want to write in chunks (here sized to GPU batch capacity). It seems there should be an interface to write the header, then write a number of elements cyclically, then add any closing rubric and close the file. Is it as simple as lib.format.write_array_header_2_0(fp, d) then writing multiple shape(N,) arrays of float by fp.write(item.tobytes())?

Reply

Sign in to reply online Use email software

Show replies by date

Robert Kern

23 Aug 23 Aug

8:02 p.m.

On Tue, Aug 23, 2022 at 8:47 PM wrote:

I want to calc multiple ndarrays at once and lack memory, so want to write in chunks (here sized to GPU batch capacity). It seems there should be an interface to write the header, then write a number of elements cyclically, then add any closing rubric and close the file.

Is it as simple as lib.format.write_array_header_2_0(fp, d) then writing multiple shape(N,) arrays of float by fp.write(item.tobytes())?

`item.tofile(fp)` is more efficient, but yes, that's the basic scheme. There is no footer after the data. The alternative is to use `np.lib.format.open_memmap(filename, mode='w+', dtype=dtype, shape=shape)`, then assign slices sequentially to the returned memory-mapped array. A memory-mapped array is usually going to be friendlier to whatever memory limits you are running into than a nominally "in-memory" array. -- Robert Kern

Reply

Sign in to reply online Use email software

Michael Siebert

24 Aug 24 Aug

12:53 a.m.

Hi all, I‘ve made the Pip/Conda module npy-append-array for exactly this purpose, see https://github.com/xor2k/npy-append-array It works with one dimensional arrays, too, of course. The key challange is to properly initialize and update the header accordingly as the array grows which my module takes care of. I‘d like to integrate this functionality directly into Numpy, see PR https://github.com/numpy/numpy/pull/20321/ but I have been busy and did have not received any feedback recently. A more direct integration into Numpy would allow to skip or ease the header update part, e.g. by introducing a new file format version. This could turn .npy into a sort of binary CSV equivalent where the size of the array is determined by the file size. Best, Michael

On 24. Aug 2022, at 03:04, Robert Kern wrote: On Tue, Aug 23, 2022 at 8:47 PM wrote:

...
I want to calc multiple ndarrays at once and lack memory, so want to write in chunks (here sized to GPU batch capacity). It seems there should be an interface to write the header, then write a number of elements cyclically, then add any closing rubric and close the file.

Is it as simple as lib.format.write_array_header_2_0(fp, d) then writing multiple shape(N,) arrays of float by fp.write(item.tobytes())?

`item.tofile(fp)` is more efficient, but yes, that's the basic scheme. There is no footer after the data.

The alternative is to use `np.lib.format.open_memmap(filename, mode='w+', dtype=dtype, shape=shape)`, then assign slices sequentially to the returned memory-mapped array. A memory-mapped array is usually going to be friendlier to whatever memory limits you are running into than a nominally "in-memory" array.

-- Robert Kern _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: michael.siebert2k@gmail.com

Reply

Sign in to reply online Use email software

Bill Ross

25 Aug 25 Aug

3:27 a.m.

Thanks, np.lib.format.open_memmap() works great! With prediction procs using minimal sys memory, I can get twice as many on GPU, with fewer optimization warnings. Why even have the number of records in the header? Shouldn't record size plus system-reported/growable file size be enough? I'd love to have a shared-mem analog for smaller-scale data; now I load data and fork to emulate that effect. My file sizes will exceed memory, so I'm hoping to get the most out of memmap. Will this in-loop assignment to predsum work to avoid loading all to memory? predsum = np.lib.format.open_memmap(outfile, mode='w+', shape=(ids_sq,), dtype=np.float32) for i in range(len(IN_FILES)): pred = numpy.lib.format.open_memmap(IN_FILES[i]) predsum = np.add(predsum, pred) ################# <- del pred del predsum -- Phobrain.com On 2022-08-23 18:02, Robert Kern wrote:

On Tue, Aug 23, 2022 at 8:47 PM wrote:

...
I want to calc multiple ndarrays at once and lack memory, so want to write in chunks (here sized to GPU batch capacity). It seems there should be an interface to write the header, then write a number of elements cyclically, then add any closing rubric and close the file.

Is it as simple as lib.format.write_array_header_2_0(fp, d) then writing multiple shape(N,) arrays of float by fp.write(item.tobytes())?

`item.tofile(fp)` is more efficient, but yes, that's the basic scheme. There is no footer after the data.

The alternative is to use `np.lib.format.open_memmap(filename, mode='w+', dtype=dtype, shape=shape)`, then assign slices sequentially to the returned memory-mapped array. A memory-mapped array is usually going to be friendlier to whatever memory limits you are running into than a nominally "in-memory" array. -- Robert Kern _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: bross_phobrain@sonic.net

Reply

Sign in to reply online Use email software

Robert Kern

10:11 a.m.

On Thu, Aug 25, 2022 at 4:27 AM Bill Ross wrote:

Thanks, np.lib.format.open_memmap() works great! With prediction procs using minimal sys memory, I can get twice as many on GPU, with fewer optimization warnings.

Why even have the number of records in the header? Shouldn't record size plus system-reported/growable file size be enough?

Only in the happy case where there is no corruption. Implicitness is not a virtue in the use cases that the format was designed for. There is an additional use case where the length is unknown a priori where implicitness would help, but the format was not designed for that case (and I'm not sure I want to add that use case).

I'd love to have a shared-mem analog for smaller-scale data; now I load data and fork to emulate that effect.

There are a number of ways to do that, including using memmap on files on a memory-backed filesystem like /dev/shm/ on Linux. See this article for several more options: https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processe...

My file sizes will exceed memory, so I'm hoping to get the most out of memmap. Will this in-loop assignment to predsum work to avoid loading all to memory?

predsum = np.lib.format.open_memmap(outfile, mode='w+', shape=(ids_sq,), dtype=np.float32)

for i in range(len(IN_FILES)):

pred = numpy.lib.format.open_memmap(IN_FILES[i])

predsum = np.add(predsum, pred) ################# <-

This will replace the `predsum` array with a new in-memory array the first time through this loop. Use `out=predsum` to make sure that the output goes into the memory-mapped array np.add(predsum, pred, out=predsum) Or the usual augmented assignment: predsum += pred

del pred del predsum

The precise memory behavior will depend on your OS's virtual memory configuration. But in general, `np.add()` will go through the arrays in order, causing the virtual memory system to page in memory pages as they are accessed for reading or writing, and page out the old ones to make room for the new pages. Linux, in my experience, isn't always the best at managing that backlog of old pages, especially if you have multiple processes doing similar kinds of things (in the past, I have seen *each* of those processes trying to use *all* of the main memory for their backlog of old pages), but there are configuration tweaks that you can make. -- Robert Kern

Reply

Sign in to reply online Use email software

608

Age (days ago)

609

Last active (days ago)

Download

4 comments

4 participants

tags

participants (4)

Bill Ross
bross_phobrain＠sonic.net
Michael Siebert
Robert Kern