<div dir="ltr">Hi Simone,<div><br></div><div>I have had a little experience with HDF5 and am interested to see where you go with this. I wonder if you could use "feather":</div><div>    <a href="https://github.com/wesm/feather">https://github.com/wesm/feather</a><br></div><div><br></div><div>There was a recent post from Wes McKinney about feather, which sparked my interest: </div><div>   <a href="http://wesmckinney.com/blog/high-perf-arrow-to-pandas/">http://wesmckinney.com/blog/high-perf-arrow-to-pandas/</a><br></div><div><br></div><div>Do you use HDF5 to store intermediates? if so, I would try storing intermediates to a file format like feather and then reducing to a HDF5 file at the end. The reduction should be IO bound and not dependent on RAM so would suit your cluster. </div><div><br></div><div>If you need to read a large array then I think HDF5 supports that (for single write but multiple reads) without the need for MPI - so this could map well to a tool like distributed: </div><div>    <a href="http://distributed.readthedocs.io/en/latest/">http://distributed.readthedocs.io/en/latest/</a></div><div><br></div><div>Not sure this helps, there is an assumption (on my part) that your intermediate calculations are not terabytes in size. </div><div><br></div><div>Good luck!</div><div><br></div><div>Nathan </div><div><br></div><div class="gmail_extra">

<br><div class="gmail_quote">On 29 December 2016 at 05:07, simone codeluppi <span dir="ltr"><<a href="mailto:simone.codeluppi@gmail.com" target="_blank">simone.codeluppi@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap">Hi all!</span></p><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap">I would like to pick your brain for some suggestion on how to modify my image analysis pipeline.</span></p><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap">I am analyzing terabytes of image stacks generated using a microscope. The current code I generated rely heavily on scikit-image, numpy and scipy. In order to speed up the analysis the code runs on a HPC computer (</span><a href="https://www.nsc.liu.se/systems/triolith/" rel="nofollow" style="margin:0px;padding:0px;border:0px;text-decoration:none;color:rgb(102,17,204)" target="_blank"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;background-color:transparent;text-decoration:underline;vertical-align:baseline;white-space:pre-wrap">https://www.nsc.liu.se/<wbr>systems/triolith/</span></a><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap">) with MPI (mpi4py) for parallelization and hdf5 (h5py) for file storage. The development cycle of the code has been pretty painful mainly due to my non familiarity with mpi and problems in compiling parallel hdf5 (with many open/closing bugs). However, the big drawback is that each core has only 2Gb of RAM (no shared ram across nodes) and in order to run some of the processing steps i ended up reserving one node (16 cores) but running only 3 cores in order to have enough ram (image chunking won’t work in this case). As you can imagine this is extremely inefficient and i end up getting low priority in the queue system.</span></p><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap"><br></span></p><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap">Our lab currently bought a new 4 nodes server with shared RAM running hadoop. My goal is to move the parallelization of the processing to dask. I tested it before in another system and works great. The drawback is that, if I understood correctly, parallel hdf5 works only with MPI (driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in keeping a good structure of the data and i can save everything as numpy arrays….very handy. </span></p><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap"><br></span></p><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap">If I will move to hadoop/dask what do you think will be a good solution for data storage? Do you have any additional suggestion that can improve the layout of the pipeline? Any help will be greatly appreciated.</span></p><span class="HOEnZb"><font color="#888888"><p dir="ltr" style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap"><br></span></p><p style="margin:0pt 0px;padding:0px;border:0px;line-height:1.38;color:rgb(34,34,34);font-family:arial,helvetica,sans-serif;font-size:13px"><span style="margin:0px;padding:0px;border:0px;font-size:14.6667px;font-family:arial;color:rgb(0,0,0);background-color:transparent;vertical-align:baseline;white-space:pre-wrap">Simone</span></p></font></span></div><span class="HOEnZb"><font color="#888888"><div dir="ltr">-- <br></div><div data-smartmail="gmail_signature"><div dir="ltr"><i><font color="#000000">Bad as he is, the Devil may be abus'd,</font></i><div><i><font color="#000000">Be falsy charg'd, and causelesly accus'd,</font></i></div><div><i><font color="#000000">When men, unwilling to be blam'd alone,</font></i></div><div><i><font color="#000000">Shift off these Crimes on Him which are their</font></i></div><div><i><font color="#000000">Own</font></i></div><div><br></div><div><pre><font size="2" face="arial, helvetica, sans-serif">                                                      <i>Daniel Defoe</i></font></pre></div><div><a>simone.codeluppi@gmail.com</a></div><div><br></div><div><a>simone@codeluppi.org</a></div><div><br></div></div></div>

</font></span><br>______________________________<wbr>_________________<br>

scikit-image mailing list<br>

<a href="mailto:scikit-image@python.org">scikit-image@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-image" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-image</a><br>

<br></blockquote></div><br></div></div>