Hi, I have been beating up the matlab io rather severely in order to implement some cleanups, fixes, and add new options. I would very much appreciate it if people could pick up the current SVN and let me know whether they have any problems. Thanks a lot, Matthew
On Thu, Feb 19, 2009 at 10:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
I have been beating up the matlab io rather severely in order to implement some cleanups, fixes, and add new options.
I would very much appreciate it if people could pick up the current SVN and let me know whether they have any problems.
r5579 works fine on my system (Ubuntu 8.04 64-bit Python 2.5). -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
Hi Matthew - it seems to work on my computer (Mac OS 10.5.6), and quite fast at that (though I haven't measured precisely). However, it isn't quite backwards compatible with code written with a previous version of mio. If I am getting things right, the changes are such that, in order to get the same result as I got with the previous version, this lines of code: mat_file = sio.loadmat('file_name.mat') variable_values = mat_file['field_name'].variable Has to now be written: mat_file = sio.loadmat('file_name.mat') field_values = mat_file['field_name'][0][0].variable[0][0] Cheers -- Ariel On Thu, Feb 19, 2009 at 8:57 PM, Nathan Bell <wnbell@gmail.com> wrote:
On Thu, Feb 19, 2009 at 10:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
I have been beating up the matlab io rather severely in order to implement some cleanups, fixes, and add new options.
I would very much appreciate it if people could pick up the current SVN and let me know whether they have any problems.
r5579 works fine on my system (Ubuntu 8.04 64-bit Python 2.5).
-- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/<http://graphics.cs.uiuc.edu/%7Ewnbell/> _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-dev
Hi Ariel, Here's a wave up the hill.
mat_file = sio.loadmat('file_name.mat') variable_values = mat_file['field_name'].variable
Has to now be written:
mat_file = sio.loadmat('file_name.mat') field_values = mat_file['field_name'][0][0].variable[0][0]
That's surprising. For a long time now, the reader has always returned at least 2D arrays from matlab, so the latter is what I was expecting. Are you sure this is a difference between 0.7 and current SVN? Can you check and then send me an example mat file with different behavior for the two versions? See you, Matthew
Hi Matthew, no - I have been comparing 0.6.0 and r5579, so everything I am saying henceforth may turn out to be irrelevant. At any rate, I attach a .mat file - for this file: In [24]: sp.__version__ Out[24]: '0.6.0' In [25]: mat_file = sio.loadmat('RMT110408.mat') In [26]: mat_file Out[26]: {'ROI': <scipy.io.mio5.mat_struct object at 0x1b2ef110>, '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: MAC, Created on: Wed Dec 3 18:45:42 2008', '__version__': '1.0'} In [31]: sp.__version__ Out[31]: '0.8.0.dev5579' In [32]: mat_file = sio.loadmat('RMT110408.mat') In [33]: mat_file Out[33]: {'ROI': array([[<scipy.io.matlab.mio5.mat_struct object at 0x1d277810>]], dtype=object), '__globals__': [], '__header__': 'MATLAB 5.0 MAT-file, Platform: MAC, Created on: Wed Dec 3 18:45:42 2008', '__version__': '1.0'}
From all that you have said, this is probably no surprise to you.
Cheers, Ariel On Fri, Feb 20, 2009 at 9:05 AM, Matthew Brett <matthew.brett@gmail.com>wrote:
Hi Ariel,
Here's a wave up the hill.
mat_file = sio.loadmat('file_name.mat') variable_values = mat_file['field_name'].variable
Has to now be written:
mat_file = sio.loadmat('file_name.mat') field_values = mat_file['field_name'][0][0].variable[0][0]
That's surprising. For a long time now, the reader has always returned at least 2D arrays from matlab, so the latter is what I was expecting.
Are you sure this is a difference between 0.7 and current SVN? Can you check and then send me an example mat file with different behavior for the two versions?
See you,
Matthew _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-dev
On Thu, Feb 19, 2009 at 7:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
I have been beating up the matlab io rather severely in order to implement some cleanups, fixes, and add new options.
I would very much appreciate it if people could pick up the current SVN and let me know whether they have any problems.
I finally got a chance to test with my nasty file, and with r5561, it now takes ~32 minutes of cpu time to load (as compared to ~5 minutes for 0.7.0, and 3 seconds for 0.6.0). All the time is in zlibstreams.py:read. I talked to the guy whose data it is now, though, and he okayed my distributing an example: http://roberts.vorpus.org/~njs/tmp/test.mat http://roberts.vorpus.org/~njs/tmp/test-mat.txt http://roberts.vorpus.org/~njs/tmp/test-mat.profile (Sorry the file is so large, all my attempts to minimize it somehow also fixed whatever is making it so pathological.) Does that help track things down? (This is also a good example file for why struct_as_record=True can be Very Very Useless, and if you combine struct_as_record=True with squeeze_me=True, the file ends up as gibberish -- a big tuple of anonymous variables, not so useful...) I'm also wondering, though, if (as you mentioned downthread somewhere) the matlab IO code ends up doing a single short read and then reads the whole actual matrix data in one fell swoop, then what benefit does this streaming code give us? I though that the point was that one could read small chunks and avoid taking the memory for a large temporary buffer, but if that's not happening, then it seems like a very slow and fragile chunk of code for no benefit. -- Nathaniel
Hi,
I finally got a chance to test with my nasty file, and with r5561, it now takes ~32 minutes of cpu time to load (as compared to ~5 minutes for 0.7.0, and 3 seconds for 0.6.0). All the time is in zlibstreams.py:read.
I talked to the guy whose data it is now, though, and he okayed my distributing an example: http://roberts.vorpus.org/~njs/tmp/test.mat http://roberts.vorpus.org/~njs/tmp/test-mat.txt http://roberts.vorpus.org/~njs/tmp/test-mat.profile (Sorry the file is so large, all my attempts to minimize it somehow also fixed whatever is making it so pathological.)
Thanks - that's very useful.
Does that help track things down? (This is also a good example file for why struct_as_record=True can be Very Very Useless, and if you combine struct_as_record=True with squeeze_me=True, the file ends up as gibberish -- a big tuple of anonymous variables, not so useful...)
Also useful - thank you.
I'm also wondering, though, if (as you mentioned downthread somewhere) the matlab IO code ends up doing a single short read and then reads the whole actual matrix data in one fell swoop, then what benefit does this streaming code give us? I though that the point was that one could read small chunks and avoid taking the memory for a large temporary buffer, but if that's not happening, then it seems like a very slow and fragile chunk of code for no benefit.
It may be that we'll have to pull it. The purpose of the two stage read - and the original purpose of the code - was to allow someone who is trying to read a particular variable to read enough of the zlib stream to get the name, in order to be able to skip it if the name is not the one they are looking for. Otherwise, they would have to read the whole stream - that might be very large - just to get the name. Thanks again, Matthew
Hi, On Fri, Feb 20, 2009 at 11:58 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I finally got a chance to test with my nasty file, and with r5561, it now takes ~32 minutes of cpu time to load (as compared to ~5 minutes for 0.7.0, and 3 seconds for 0.6.0). All the time is in zlibstreams.py:read.
Actually, thinking about it, I wonder if it's the string slicing in getting the data out of zlibstream that is taking the time. I suppose that might happen if you have lots of tiny matrices in there. Could you try: import scipy.io.matlab as matlab matlab.bench() What kind of numbers do you get? Best, Matthew
On Sat, 21 Feb 2009 00:25:04 -0800 Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Fri, Feb 20, 2009 at 11:58 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I finally got a chance to test with my nasty file, and with r5561, it now takes ~32 minutes of cpu time to load (as compared to ~5 minutes for 0.7.0, and 3 seconds for 0.6.0). All the time is in zlibstreams.py:read.
Actually, thinking about it, I wonder if it's the string slicing in getting the data out of zlibstream that is taking the time. I suppose that might happen if you have lots of tiny matrices in there. Could you try:
import scipy.io.matlab as matlab matlab.bench()
What kind of numbers do you get?
Best,
Matthew
Hi Matthew, I just run the benchmark. Here are the results:
matlab.bench() Running benchmarks for scipy.io.matlab NumPy version 1.3.0.dev6436 NumPy is installed in /home/nwagner/local/lib64/python2.6/site-packages/numpy SciPy version 0.8.0.dev5581 SciPy is installed in /home/nwagner/local/lib64/python2.6/site-packages/scipy Python version 2.6 (r26:66714, Feb 3 2009, 20:49:49) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] nose version 0.10.4
<class 'scipy.io.matlab.zlibstreams.ZlibInputStream'> reading gzip streams ======================================== time(s) | nbytes ---------------------------------------- 0.060 | 1.500 | 4000000 0.240 | 1.200 | 20000000 <class 'scipy.io.matlab.zlibstreams.TwoShotZlibInputStream'> reading gzip streams ======================================== time(s) | nbytes ---------------------------------------- 0.060 | 1.500 | 4000000 0.240 | 1.333 | 20000000 . ---------------------------------------------------------------------- Ran 1 test in 10.152s OK True Nils
Hi, On Fri, Feb 20, 2009 at 10:28 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, Feb 19, 2009 at 7:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
I have been beating up the matlab io rather severely in order to implement some cleanups, fixes, and add new options.
I would very much appreciate it if people could pick up the current SVN and let me know whether they have any problems.
I finally got a chance to test with my nasty file, and with r5561, it now takes ~32 minutes of cpu time to load (as compared to ~5 minutes for 0.7.0, and 3 seconds for 0.6.0). All the time is in zlibstreams.py:read.
Could you check current SVN again and see how it works? I've sped up zlibstreams and it's now saving memory on the read, at about a 12% drop in speed, now I think due to the overhead of the single extra function calls on many small reads. I'm unsure whether I want to leave zlibstreams in. It has the advantage of making skipping variables much faster and more memory efficient, and maybe some increase in memory efficiency as the variable is read, but still, the small performance penalty is annoying. Best, Matthew
On Sun, Feb 22, 2009 at 1:01 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
On Fri, Feb 20, 2009 at 10:28 PM, Nathaniel Smith <njs@pobox.com> wrote:
I finally got a chance to test with my nasty file, and with r5561, it now takes ~32 minutes of cpu time to load (as compared to ~5 minutes for 0.7.0, and 3 seconds for 0.6.0). All the time is in zlibstreams.py:read.
Could you check current SVN again and see how it works?
It's down to 4 seconds. Yay.
I've sped up zlibstreams and it's now saving memory on the read, at about a 12% drop in speed, now I think due to the overhead of the single extra function calls on many small reads.
I'm unsure whether I want to leave zlibstreams in. It has the advantage of making skipping variables much faster and more memory efficient, and maybe some increase in memory efficiency as the variable is read, but still, the small performance penalty is annoying.
IMHO, if it lets one load gigabyte-matrices without allocating gigabyte temp variables, then that's a qualitative difference that's worth a small slowdown. If not, then neither the memory savings or the slowdown are large enough for me to care much. (I don't tend to save/load matlab files in my inner loops, personally.) The thing that does make me nervous is this code's fragility (as has been demonstrated repeatedly now). It's really non-obvious how small changes will affect its performance characteristics. Having read your changes, it isn't at all obvious to me why it's faster now. And e.g. I had to read StringIO.py to understand why you were recreating the StringIO object on every __fill. Just looking at zlibstreams.py, it appears wasteful and should be removed, but now I think that doing so could make it super-slow again. Basically, I just don't want to have to come back at every release and complain about my weird files again... -- Nathaniel
On Sun, Feb 22, 2009 at 02:05:41AM -0800, Nathaniel Smith wrote:
Basically, I just don't want to have to come back at every release and complain about my weird files again...
Contribute tests? If possible this seems the best way to ensure consistency. Gaël
On Sun, Feb 22, 2009 at 3:29 AM, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
On Sun, Feb 22, 2009 at 02:05:41AM -0800, Nathaniel Smith wrote:
Basically, I just don't want to have to come back at every release and complain about my weird files again...
Contribute tests? If possible this seems the best way to ensure consistency.
I would -- and I posted a link to the test file I'm using upthread -- but it's 300 megabytes and I don't know how to produce a smaller one. (The obvious tricks don't seem to work.) You're certainly welcome to include it if you *want*, but... -- Nathaniel
On Thu, Feb 19, 2009 at 10:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I have been beating up the matlab io rather severely in order to implement some cleanups, fixes, and add new options.
I would very much appreciate it if people could pick up the current SVN and let me know whether they have any problems.
Adding Antonino, who made some comments in a related SciPy-User thread: http://article.gmane.org/gmane.comp.python.scientific.user/19614 -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
Matthew Brett <matthew.brett <at> gmail.com> writes:
Hi,
I have been beating up the matlab io rather severely in order to implement some cleanups, fixes, and add new options.
I would very much appreciate it if people could pick up the current SVN and let me know whether they have any problems.
I tried the SVN version and found it very fast. Even putting 1M as blocksize in scipy 0.7.0 the new version is a lot faster. Here there are benchmarks loading a 50MB matlab file: *SCIPY 0.7.0 modified with blocksize=1M* 4771 function calls (4768 primitive calls) in 1.318 CPU seconds Ordered by: internal time List reduced from 49 to 3 due to restriction <3> ncalls tottime percall cumtime percall filename:lineno(function) 400 1.011 0.003 1.011 0.003 {built-in method decompress} 10 0.086 0.009 0.158 0.016 /usr/lib/python2.5/StringIO.py:95(seek) 5 0.072 0.014 0.072 0.014 {method 'join' of 'str' objects} *SCIPY '0.8.0.dev5592'* 582 function calls (579 primitive calls) in 2.957 CPU seconds Ordered by: internal time List reduced from 40 to 3 due to restriction <3> ncalls tottime percall cumtime percall filename:lineno(function) 27 1.823 0.068 2.846 0.105 gzipstreams.py:77(__fill) 52 0.963 0.019 0.963 0.019 {built-in method decompress} 9 0.065 0.007 0.065 0.007 {method 'copy' of 'numpy.ndarray' objects}
Thanks a lot,
Thanks for you work :)
Matthew
~ Antonio PS: put me in CC since I'm not a SciPy subscriber
Antonio <tritemio <at> gmail.com> writes:
*SCIPY 0.7.0 modified with blocksize=1M*
I swapped the headers cutting and pasting the benchmarks, sorry. The conclusions do not change. This one is the new version SCIPY '0.8.0.dev5592 (not the 0.7.0)
4771 function calls (4768 primitive calls) in 1.318 CPU seconds
Ordered by: internal time List reduced from 49 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function) 400 1.011 0.003 1.011 0.003 {built-in method decompress} 10 0.086 0.009 0.158 0.016 /usr/lib/python2.5/StringIO.py:95(seek) 5 0.072 0.014 0.072 0.014 {method 'join' of 'str' objects}
while
*SCIPY '0.8.0.dev5592'*
the following refers to the old SCIPY 0.7.0
582 function calls (579 primitive calls) in 2.957 CPU seconds
Ordered by: internal time List reduced from 40 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function) 27 1.823 0.068 2.846 0.105 gzipstreams.py:77(__fill) 52 0.963 0.019 0.963 0.019 {built-in method decompress} 9 0.065 0.007 0.065 0.007 {method 'copy' of 'numpy.ndarray' objects}
As previously mentioned, the new version is faster. ~ Antonio
participants (7)
-
Antonio
-
Ariel Rokem
-
Gael Varoquaux
-
Matthew Brett
-
Nathan Bell
-
Nathaniel Smith
-
Nils Wagner