Hello, I use numpy.histogramdd to compute three dimensional histograms with a total number of bins in the order of 1e7. It is clear to me, that such a histogram will take a lot of memory. For a dtype=N.float64, it will take roughly 80 megabytes. However, I have the feeling that during the histogram calculation, much more memory is needed. For example, when I have data.shape = (8e6, 3) and do a numpy.histogramdd(d, 280), I expect a histogram size of (280**3)*8 = 176 megabytes, but during histogram calculation the memory need of pythonw.exe in the Windows Task Manager increases up to 687 megabytes over the level before histogram calculation. When the calculation is done, the mem usage drops down to the expected value. I assume this is due to the internal way, numpy.histogramdd works. However, when I need to calculate even bigger histograms, I cannot do it this way. So I have the following questions: 1) How can I tell histogramdd to use another dtype than float64? My bins will be very little populated so an int16 should be sufficient. Without normalization, a Integer dtype makes more sense to me. 2) Is there a way to use another algorithm (at the cost of performance) that uses less memory during calculation so that I can generate bigger histograms? My numpy version is '1.0.4.dev3937' Thanks, Lars -- Dipl.-Ing. Lars Friedrich Photonic Measurement Technology Department of Microsystems Engineering -- IMTEK University of Freiburg Georges-Köhler-Allee 102 D-79110 Freiburg Germany phone: +49-761-203-7531 fax: +49-761-203-7537 room: 01 088 email: lars.friedrich@imtek.de
Hi Lars,
[...]
2008/2/1, Lars Friedrich
1) How can I tell histogramdd to use another dtype than float64? My bins will be very little populated so an int16 should be sufficient. Without normalization, a Integer dtype makes more sense to me.
There is no way you'll be able to ask that without tweaking the histogramdd function yourself. The relevant bit of code is the instantiation of hist : hist = zeros(nbin.prod(), float) 2) Is there a way to use another algorithm (at the cost of performance)
that uses less memory during calculation so that I can generate bigger histograms?
You could work through your array block by block. Simply fix the range and generate an histogram for each slice of 100k data and sum them up at the end. The current histogram and histogramdd implementation has the advantage of being general, that is you can work with uniform or non-uniform bins, but it is not particularly efficient, at least for large number of bins (>30). Cheers, David My numpy version is '1.0.4.dev3937'
Thanks, Lars
-- Dipl.-Ing. Lars Friedrich
Photonic Measurement Technology Department of Microsystems Engineering -- IMTEK University of Freiburg Georges-Köhler-Allee 102 D-79110 Freiburg Germany
phone: +49-761-203-7531 fax: +49-761-203-7537 room: 01 088 email: lars.friedrich@imtek.de _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
participants (2)
-
David Huard
-
Lars Friedrich