Mailman 3 Re: [scikit-image] Image analysis pipeline improvement suggestions - scikit-image

6 Jan 2017

      Hi Nathan
thanks a lot for the infos!
I looked into it and it seems promising, I just need to understand if it is
possible to save numpy data into feather.

In the pipeline I developed, in the majority of the intermediate steps the
generated numpy arrays are saved as single file (as .lzma using
joblib.dump) that are then written on a single hdf5 file at the end of the
processing step. However, as it is now one of the steps require parallel
writing to the file. After your suggestion and some discussion on stack
overflow I think that the best solution will be to modify the code and skip
the parallel writing.

The main issue with the RAM however is related to some of the large image
stacks I need to filter (using scipy already made filters). Unfortunately
the only way out will be to chunk the images, process them on different
cores and then re-join them if the full image can fit in RAM.

Thanks a lot for the help!

Simone

On Wed, Dec 28, 2016 at 2:37 PM Nathan Faggian <nathan.faggian@gmail.com>
wrote:
...
Hi Simone,
I have had a little experience with HDF5 and am interested to see where
you go with this. I wonder if you could use "feather":
    https://github.com/wesm/feather
There was a recent post from Wes McKinney about feather, which sparked my
interest:
   http://wesmckinney.com/blog/high-perf-arrow-to-pandas/
Do you use HDF5 to store intermediates? if so, I would try storing
intermediates to a file format like feather and then reducing to a HDF5
file at the end. The reduction should be IO bound and not dependent on RAM
so would suit your cluster.
If you need to read a large array then I think HDF5 supports that (for
single write but multiple reads) without the need for MPI - so this could
map well to a tool like distributed:
    http://distributed.readthedocs.io/en/latest/
Not sure this helps, there is an assumption (on my part) that your
intermediate calculations are not terabytes in size.
Good luck!
Nathan
On 29 December 2016 at 05:07, simone codeluppi <simone.codeluppi@gmail.com
...
wrote:
Hi all!
I would like to pick your brain for some suggestion on how to modify my
image analysis pipeline.
I am analyzing terabytes of image stacks generated using a microscope. The
current code I generated rely heavily on scikit-image, numpy and scipy. In
order to speed up the analysis the code runs on a HPC computer (
https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for
parallelization and hdf5 (h5py) for file storage. The development cycle of
the code has been pretty painful mainly due to my non familiarity with mpi
and problems in compiling parallel hdf5 (with many open/closing bugs).
However, the big drawback is that each core has only 2Gb of RAM (no shared
ram across nodes) and in order to run some of the processing steps i ended
up reserving one node (16 cores) but running only 3 cores in order to have
enough ram (image chunking won’t work in this case). As you can imagine
this is extremely inefficient and i end up getting low priority in the
queue system.
Our lab currently bought a new 4 nodes server with shared RAM running
hadoop. My goal is to move the parallelization of the processing to dask. I
tested it before in another system and works great. The drawback is that,
if I understood correctly, parallel hdf5 works only with MPI
(driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in
keeping a good structure of the data and i can save everything as numpy
arrays….very handy.
If I will move to hadoop/dask what do you think will be a good solution
for data storage? Do you have any additional suggestion that can improve
the layout of the pipeline? Any help will be greatly appreciated.
Simone
--
*Bad as he is, the Devil may be abus'd,*
*Be falsy charg'd, and causelesly accus'd,*
*When men, unwilling to be blam'd alone,*
*Shift off these Crimes on Him which are their*
*Own*
*Daniel Defoe*
simone.codeluppi@gmail.com
simone@codeluppi.org
_______________________________________________
scikit-image mailing list
scikit-image@python.org
https://mail.python.org/mailman/listinfo/scikit-image
--
*Bad as he is, the Devil may be abus'd,*
*Be falsy charg'd, and causelesly accus'd,*
*When men, unwilling to be blam'd alone,*
*Shift off these Crimes on Him which are their*
*Own*
*Daniel Defoe*

simone.codeluppi@gmail.com

simone@codeluppi.org

Re: [scikit-image] Image analysis pipeline improvement suggestions

simone codeluppi

tags

participants (1)