Mailman 3 memory-efficient loadtxt - NumPy-Discussion

memory-efficient loadtxt

older
Re: [Numpy-discussion] tests for...

Paul Anton Letnes

Sept. 30, 2012

10:14 a.m.

Hello everyone, I've modified loadtxt to make it (potentially) more memory efficient. The idea is that if a user passes a seekable file, (s)he can also pass the 'seekable=True' kwarg. Then, loadtxt will count the number of lines (containing data) and allocate an array of exactly the right size to hold the loaded data. The downside is that the line counting more than doubles the runtime, as it loops over the file twice, and there's a sort-of unnecessary np.array function call in the loop. The branch is called faster-loadtxt, which is silly due to the runtime doubling, but I'm hoping that the false advertising is acceptable :) (I naively expected a speedup by removing some needless list manipulation.) I'm pretty sure that the function can be micro-optimized quite a bit here and there, and in particular, the main for loop is a bit duplicated right now. However, I got the impression that someone was working on a More Advanced (TM) C-based file reader, which will replace loadtxt; this patch is intended as a useful thing to have while we're waiting for that to appear. The patch passes all tests in the test suite, and documentation for the kwarg has been added. I've modified all tests to include the seekable kwarg, but that was mostly to check that all tests are passed also with this kwarg. I guess it's bit too late for 1.7.0 though? Should I make a pull request? I'm happy to take any and all suggestions before I do. Cheers Paul

Show replies by date

Paul Anton Letnes

September 2012

10:16 a.m.

For convenience and clarity, this is the diff in question: https://github.com/Dynetrekk/numpy-1/commit/5bde67531a2005ef80a2690a75c65ceb... And this is my numpy fork: https://github.com/Dynetrekk/numpy-1/ Paul On Sun, Sep 30, 2012 at 4:14 PM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:

...

Hello everyone,

I've modified loadtxt to make it (potentially) more memory efficient. The idea is that if a user passes a seekable file, (s)he can also pass the 'seekable=True' kwarg. Then, loadtxt will count the number of lines (containing data) and allocate an array of exactly the right size to hold the loaded data. The downside is that the line counting more than doubles the runtime, as it loops over the file twice, and there's a sort-of unnecessary np.array function call in the loop. The branch is called faster-loadtxt, which is silly due to the runtime doubling, but I'm hoping that the false advertising is acceptable :) (I naively expected a speedup by removing some needless list manipulation.)

I'm pretty sure that the function can be micro-optimized quite a bit here and there, and in particular, the main for loop is a bit duplicated right now. However, I got the impression that someone was working on a More Advanced (TM) C-based file reader, which will replace loadtxt; this patch is intended as a useful thing to have while we're waiting for that to appear.

The patch passes all tests in the test suite, and documentation for the kwarg has been added. I've modified all tests to include the seekable kwarg, but that was mostly to check that all tests are passed also with this kwarg. I guess it's bit too late for 1.7.0 though?

Should I make a pull request? I'm happy to take any and all suggestions before I do.

Cheers Paul

Chris Barker

October 2012

3:07 p.m.

Paul, Nice to see someone working on these issues, but: I'm not sure the problem you are trying to solve -- accumulating in a list is pretty efficient anyway -- not a whole lot overhead. But if you do want to improve that, it may be better to change the accumulating method, rather than doing the double-read thing. I"ve written, and posted here, code that provides an Acumulator that uses numpy internally, so not much memory overhead. In the end, it's not any faster than accumulating in a list and then converting to an array, but it does use less memory. I also have a Cython version that is not quite done (darn regular job getting in the way) that is both faster and more memory efficient. Also, frankly, just writing the array pre-allocation and re-sizeing code into loadtxt would not be a whole lot of code either, and would be both fast and memory efficient. Let mw know if you want any of my code to play with.

...

However, I got the impression that someone was working on a More Advanced (TM) C-based file reader, which will replace loadtxt;

yes -- I wonder what happened with that? Anyone? -CHB this patch is intended as a useful thing to have

...

while we're waiting for that to appear.

The patch passes all tests in the test suite, and documentation for the kwarg has been added. I've modified all tests to include the seekable kwarg, but that was mostly to check that all tests are passed also with this kwarg. I guess it's bit too late for 1.7.0 though?

Should I make a pull request? I'm happy to take any and all suggestions before I do.

Cheers Paul _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Wes McKinney

11:48 a.m.

On Monday, October 1, 2012, Chris Barker wrote:

...

Paul,

Nice to see someone working on these issues, but:

I'm not sure the problem you are trying to solve -- accumulating in a list is pretty efficient anyway -- not a whole lot overhead.

But if you do want to improve that, it may be better to change the accumulating method, rather than doing the double-read thing. I"ve written, and posted here, code that provides an Acumulator that uses numpy internally, so not much memory overhead. In the end, it's not any faster than accumulating in a list and then converting to an array, but it does use less memory.

I also have a Cython version that is not quite done (darn regular job getting in the way) that is both faster and more memory efficient.

Also, frankly, just writing the array pre-allocation and re-sizeing code into loadtxt would not be a whole lot of code either, and would be both fast and memory efficient.

Let mw know if you want any of my code to play with.

...
However, I got the impression that someone was working on a More Advanced (TM) C-based file reader, which will replace loadtxt;

yes -- I wonder what happened with that? Anyone?

-CHB

this patch is intended as a useful thing to have

...
while we're waiting for that to appear.

The patch passes all tests in the test suite, and documentation for the kwarg has been added. I've modified all tests to include the seekable kwarg, but that was mostly to check that all tests are passed also with this kwarg. I guess it's bit too late for 1.7.0 though?

Should I make a pull request? I'm happy to take any and all suggestions before I do.

Cheers Paul _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <javascript:;> http://mail.scipy.org/mailman/listinfo/numpy-discussion

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov <javascript:;> _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <javascript:;> http://mail.scipy.org/mailman/listinfo/numpy-discussion

I've finally built a new, very fast C-based tokenizer/parser with type inference, NA-handling, etc. for pandas sporadically over the last month-- it's almost ready to ship. It's roughly an order of magnitude faster than loadtxt and uses very little temporary space. Should be easy to push upstream into NumPy to replace the innards of np.loadtxt if I can get a bit of help with the plumbing (it already yields structured arrays in addition to pandas DataFrames so there isn't a great deal that needs doing). Blog post with CPU and memory benchmarks to follow-- will post a link here. - Wes

Paul Anton Letnes

11:58 a.m.

On 3. okt. 2012, at 17:48, Wes McKinney wrote:

...

On Monday, October 1, 2012, Chris Barker wrote: Paul,

Nice to see someone working on these issues, but:

I'm not sure the problem you are trying to solve -- accumulating in a list is pretty efficient anyway -- not a whole lot overhead.

But if you do want to improve that, it may be better to change the accumulating method, rather than doing the double-read thing. I"ve written, and posted here, code that provides an Acumulator that uses numpy internally, so not much memory overhead. In the end, it's not any faster than accumulating in a list and then converting to an array, but it does use less memory.

I also have a Cython version that is not quite done (darn regular job getting in the way) that is both faster and more memory efficient.

Also, frankly, just writing the array pre-allocation and re-sizeing code into loadtxt would not be a whole lot of code either, and would be both fast and memory efficient.

Let mw know if you want any of my code to play with.

...
However, I got the impression that someone was working on a More Advanced (TM) C-based file reader, which will replace loadtxt;

yes -- I wonder what happened with that? Anyone?

-CHB

this patch is intended as a useful thing to have

...
while we're waiting for that to appear.

The patch passes all tests in the test suite, and documentation for the kwarg has been added. I've modified all tests to include the seekable kwarg, but that was mostly to check that all tests are passed also with this kwarg. I guess it's bit too late for 1.7.0 though?

Should I make a pull request? I'm happy to take any and all suggestions before I do.

Cheers Paul _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

I've finally built a new, very fast C-based tokenizer/parser with type inference, NA-handling, etc. for pandas sporadically over the last month-- it's almost ready to ship. It's roughly an order of magnitude faster than loadtxt and uses very little temporary space. Should be easy to push upstream into NumPy to replace the innards of np.loadtxt if I can get a bit of help with the plumbing (it already yields structured arrays in addition to pandas DataFrames so there isn't a great deal that needs doing).

Blog post with CPU and memory benchmarks to follow-- will post a link here.

- Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

So Chris, looks like Wes has us beaten in every conceivable way. Hey, that's a good thing :) I suppose the thing to do now is to make sure Wes' function tackles the loadtxt test suite? Paul

Paul Anton Letnes

12:05 p.m.

On 1. okt. 2012, at 21:07, Chris Barker wrote:

...

Paul,

Nice to see someone working on these issues, but:

I'm not sure the problem you are trying to solve -- accumulating in a list is pretty efficient anyway -- not a whole lot overhead.

Oh, there's significant overhead, since we're not talking of a list - we're talking of a list-of-lists. My guesstimate from my hacking session (off the top of my head - I don't have my benchmarks in working memory :) is around 3-5 times more memory with the list-of-lists approach for a single column / 1D array, which presumably is the worst case (a length 1 list for each line of input). Hence, if you want to load a 2 GB file into RAM on a machine with 4 GB of the stuff, you're screwed with the old approach and a happy camper with mine.

...

But if you do want to improve that, it may be better to change the accumulating method, rather than doing the double-read thing. I"ve written, and posted here, code that provides an Acumulator that uses numpy internally, so not much memory overhead. In the end, it's not any faster than accumulating in a list and then converting to an array, but it does use less memory.

I see your point - but if you're to return a single array, and the file is close to the total system memory, you've still got a factor of 2 issue when shuffling the binary data from the accumulator into the result array. That is, unless I'm missong something here? Cheers Paul

Chris Barker

12:22 p.m.

On Wed, Oct 3, 2012 at 9:05 AM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:

...

...
I'm not sure the problem you are trying to solve -- accumulating in a list is pretty efficient anyway -- not a whole lot overhead.

Oh, there's significant overhead, since we're not talking of a list - we're talking of a list-of-lists.

hmm, a list of nupy scalers (custom dtype) would be a better option, though maybe not all that much better -- still an extra pointer and pyton object for each row.

...

I see your point - but if you're to return a single array, and the file is close to the total system memory, you've still got a factor of 2 issue when shuffling the binary data from the accumulator into the result array. That is, unless I'm missong something here?

Indeed, I think that's how my current accumulator works -- the __array__() method returns a copy of the data buffer, so that you won't accidentally re-allocate it under the hood later and screw up the returned version. But it is indeed accumulating in a numpy array, so it should be pretty possible, maybe even easy to turn it into a regular array without a data copy. You'd just have to destroy the original somehow (or mark it as never-resize) so you wouldn't have the clash. messing wwith the OWNDATA flags might take care of that. But it seems Wes has a better solution. One other note, though -- if you have arrays that are that close to max system memory, you are very likely to have other trouble anyway -- numpy does make a lot of copies! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Paul Anton Letnes

12:36 p.m.

On 3. okt. 2012, at 18:22, Chris Barker wrote:

...

On Wed, Oct 3, 2012 at 9:05 AM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:

...
...
I'm not sure the problem you are trying to solve -- accumulating in a list is pretty efficient anyway -- not a whole lot overhead.

Oh, there's significant overhead, since we're not talking of a list - we're talking of a list-of-lists.

hmm, a list of nupy scalers (custom dtype) would be a better option, though maybe not all that much better -- still an extra pointer and pyton object for each row.

...
I see your point - but if you're to return a single array, and the file is close to the total system memory, you've still got a factor of 2 issue when shuffling the binary data from the accumulator into the result array. That is, unless I'm missong something here?

Indeed, I think that's how my current accumulator works -- the __array__() method returns a copy of the data buffer, so that you won't accidentally re-allocate it under the hood later and screw up the returned version.

But it is indeed accumulating in a numpy array, so it should be pretty possible, maybe even easy to turn it into a regular array without a data copy. You'd just have to destroy the original somehow (or mark it as never-resize) so you wouldn't have the clash. messing wwith the OWNDATA flags might take care of that.

But it seems Wes has a better solution.

Indeed.

...

One other note, though -- if you have arrays that are that close to max system memory, you are very likely to have other trouble anyway -- numpy does make a lot of copies!

That's true. Now, I'm not worried about this myself, but several people have complained about this on the mailing list, and it seemed like an easy fix. Oh well, it's too late for it now, anyways. Paul

4522

Age (days ago)

4525

Last active (days ago)

List overview

Download

7 comments

3 participants

participants (3)

Chris Barker
Paul Anton Letnes
Wes McKinney

memory-efficient loadtxt

Paul Anton Letnes

Paul Anton Letnes

Chris Barker

Wes McKinney

Paul Anton Letnes

Paul Anton Letnes

Chris Barker

Paul Anton Letnes

tags

participants (3)