[scikit-image] Image analysis pipeline improvement suggestions
Nathan Faggian
nathan.faggian at gmail.com
Wed Dec 28 17:37:54 EST 2016
Hi Simone,
I have had a little experience with HDF5 and am interested to see where you
go with this. I wonder if you could use "feather":
https://github.com/wesm/feather
There was a recent post from Wes McKinney about feather, which sparked my
interest:
http://wesmckinney.com/blog/high-perf-arrow-to-pandas/
Do you use HDF5 to store intermediates? if so, I would try storing
intermediates to a file format like feather and then reducing to a HDF5
file at the end. The reduction should be IO bound and not dependent on RAM
so would suit your cluster.
If you need to read a large array then I think HDF5 supports that (for
single write but multiple reads) without the need for MPI - so this could
map well to a tool like distributed:
http://distributed.readthedocs.io/en/latest/
Not sure this helps, there is an assumption (on my part) that your
intermediate calculations are not terabytes in size.
Good luck!
Nathan
On 29 December 2016 at 05:07, simone codeluppi <simone.codeluppi at gmail.com>
wrote:
> Hi all!
>
> I would like to pick your brain for some suggestion on how to modify my
> image analysis pipeline.
>
> I am analyzing terabytes of image stacks generated using a microscope. The
> current code I generated rely heavily on scikit-image, numpy and scipy. In
> order to speed up the analysis the code runs on a HPC computer (
> https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for
> parallelization and hdf5 (h5py) for file storage. The development cycle of
> the code has been pretty painful mainly due to my non familiarity with mpi
> and problems in compiling parallel hdf5 (with many open/closing bugs).
> However, the big drawback is that each core has only 2Gb of RAM (no shared
> ram across nodes) and in order to run some of the processing steps i ended
> up reserving one node (16 cores) but running only 3 cores in order to have
> enough ram (image chunking won’t work in this case). As you can imagine
> this is extremely inefficient and i end up getting low priority in the
> queue system.
>
>
> Our lab currently bought a new 4 nodes server with shared RAM running
> hadoop. My goal is to move the parallelization of the processing to dask. I
> tested it before in another system and works great. The drawback is that,
> if I understood correctly, parallel hdf5 works only with MPI
> (driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in
> keeping a good structure of the data and i can save everything as numpy
> arrays….very handy.
>
>
> If I will move to hadoop/dask what do you think will be a good solution
> for data storage? Do you have any additional suggestion that can improve
> the layout of the pipeline? Any help will be greatly appreciated.
>
>
> Simone
> --
> *Bad as he is, the Devil may be abus'd,*
> *Be falsy charg'd, and causelesly accus'd,*
> *When men, unwilling to be blam'd alone,*
> *Shift off these Crimes on Him which are their*
> *Own*
>
> *Daniel Defoe*
>
> simone.codeluppi at gmail.com
>
> simone at codeluppi.org
>
>
> _______________________________________________
> scikit-image mailing list
> scikit-image at python.org
> https://mail.python.org/mailman/listinfo/scikit-image
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-image/attachments/20161229/e9d60290/attachment-0001.html>
More information about the scikit-image
mailing list