[scikit-image] Image analysis pipeline improvement suggestions

Wed Dec 28 17:37:54 EST 2016

Hi Simone,

I have had a little experience with HDF5 and am interested to see where you
go with this. I wonder if you could use "feather":
    https://github.com/wesm/feather

There was a recent post from Wes McKinney about feather, which sparked my
interest:
   http://wesmckinney.com/blog/high-perf-arrow-to-pandas/

Do you use HDF5 to store intermediates? if so, I would try storing
intermediates to a file format like feather and then reducing to a HDF5
file at the end. The reduction should be IO bound and not dependent on RAM
so would suit your cluster.

If you need to read a large array then I think HDF5 supports that (for
single write but multiple reads) without the need for MPI - so this could
map well to a tool like distributed:
    http://distributed.readthedocs.io/en/latest/

Not sure this helps, there is an assumption (on my part) that your
intermediate calculations are not terabytes in size.

Good luck!

Nathan

On 29 December 2016 at 05:07, simone codeluppi <simone.codeluppi at gmail.com>
wrote:

> Hi all!
>
> I would like to pick your brain for some suggestion on how to modify my
> image analysis pipeline.
>
> I am analyzing terabytes of image stacks generated using a microscope. The
> current code I generated rely heavily on scikit-image, numpy and scipy. In
> order to speed up the analysis the code runs on a HPC computer (
> https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for
> parallelization and hdf5 (h5py) for file storage. The development cycle of
> the code has been pretty painful mainly due to my non familiarity with mpi
> and problems in compiling parallel hdf5 (with many open/closing bugs).
> However, the big drawback is that each core has only 2Gb of RAM (no shared
> ram across nodes) and in order to run some of the processing steps i ended
> up reserving one node (16 cores) but running only 3 cores in order to have
> enough ram (image chunking won’t work in this case). As you can imagine
> this is extremely inefficient and i end up getting low priority in the
> queue system.
>
>
> Our lab currently bought a new 4 nodes server with shared RAM running
> hadoop. My goal is to move the parallelization of the processing to dask. I
> tested it before in another system and works great. The drawback is that,
> if I understood correctly, parallel hdf5 works only with MPI
> (driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in
> keeping a good structure of the data and i can save everything as numpy
> arrays….very handy.
>
>
> If I will move to hadoop/dask what do you think will be a good solution
> for data storage? Do you have any additional suggestion that can improve
> the layout of the pipeline? Any help will be greatly appreciated.
>
>
> Simone
> --
> *Bad as he is, the Devil may be abus'd,*
> *Be falsy charg'd, and causelesly accus'd,*
> *When men, unwilling to be blam'd alone,*
> *Shift off these Crimes on Him which are their*
> *Own*
>
>                                                       *Daniel Defoe*
>
> simone.codeluppi at gmail.com
>
> simone at codeluppi.org
>
>
> _______________________________________________
> scikit-image mailing list
> scikit-image at python.org
> https://mail.python.org/mailman/listinfo/scikit-image
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-image/attachments/20161229/e9d60290/attachment-0001.html>