Hey all, Thanks for the nice discussion. As far as the ND discussion goes it seems: 1. The inclusion of new algorithms should be weighed in terms of their availability in other popular image processing libraries. Can scikit-image provide any enhancement? 2. If so, a 2D-only algorithm should not necessary block the contribution. 3. Cleanups and enhancements should proceed while motivating the need for an API that is compatible with ND extensions. 4. Rethink the ND-ification of the algorithm when most other issues have been addressed. There seems to be an other discussion brewing, about how to experiment with different memory models and coding styles that might be (in)compatible with arrays such as dask and cupy. Somebody should probalby open up why we should invest in numba. Mark On Thu, Aug 8, 2019 at 1:51 AM Juan Nunez-Iglesias <jni@fastmail.com> wrote:
I have actual work to do but Josh activated this instinct <https://xkcd.com/386/> within me. =P I also wouldn't be much of a Guardian of the Dimensions if I didn't respond to a full frontal assault on said dimensions. =P
I agree 100% that most data is either 2D or 3D. *Sometimes* have 3D movies (3D + t), and if the time resolution is fine enough, then a 4D algorithm is advantageous. But that's rare. However, such 4D data is often kept in a 4D array (and sometimes even 5D where the 5th dimension is samples), and it's easier to write
ndimage.gaussian_filter(arr5d, sigma=(0, 0, 1, 5, 5))
than to do two nested for-loops. This is doubly true now that we have dask arrays.
But anyway, I accept that 2D+3D is good enough, in terms of utility.
What I absolutely don't accept is that either maintainability or performance need to suffer to go nD.
My talk at SciPy about getting maximum performance out of Numba was all based on an nD implementation. All the same issues arise in 2D/3D and nD. The main issue with Numba or Cython is dealing with complex data structures such as heaps, queues, priority queues, etc. Those issues are identical for 2D and nD code.
The padding of a volume to avoid neighbor issues was done for convenience. It can and should be removed, and checking for out-of-bounds access is a mod (%) calculation away. Indeed, when it comes to performance, and optimising loop order for cache-efficient memory access, you can *only* do it correctly if you abstract away the axes. If you do for x in im.shape[0]: for y in im.shape[1]:, and now you want to adapt your code to a Fortran-ordered array, it's very hard. If your logic is instead something like for ax in range(im.ndim): for x in range(im.shape[ax]): ..., you can replace the range(ndim) with sorted(range(ndim), key=lambda i: im.strides[i]).
In short, nD opens up *more* avenues for optimisation, not fewer.
When it comes to maintainability:
1 - branching code paths (2D/3D) are not more maintainable, they are less maintainable. If something needs to be changed in the future (say, adding tolerance to the neighbour check, or adding an arbitrary predicate for adding a neighbour), it now needs to be changed in several places. 2 - 2D code and especially 3D code like the one you linked is locally simple, but globally complex. It's hard to know if you've missed a direction. It's hard to adapt to different neighbourhoods. 3 - And it results in many more lines of code (you would have 26 if statements for full connectivity in 3D), among which errors can hide.
In many cases, changing the code to nD makes it *more* readable and maintainable, because magic values get replaced by a principled iteration over the possible axes/axis combinations. See for example my conversion of Hessian filter to nD:
https://github.com/scikit-image/scikit-image/pull/2194/files#diff-70ef54b6b6...
(which resulted in a nice speedup btw!)
And who can forget Vighnesh's nD RAG code? Compare the 2D/3D approach to the nD one:
https://github.com/elegant-scipy/elegant-scipy/blob/master/markdown/ch3.mark...
I find it quite sobering to thing that this code might never had existed if I hadn't prodded Vighnesh to go for nD.
I argue that nD code is a bit like 0-indexing vs 1-indexing: at first 0-indexing is weirder, less intuitive, and you make more off-by-one errors because you're not used to thinking about it. But after just a bit of practice, it becomes intuitive, you make *fewer* errors because you are not adding and subtracting 1 all over the place, and the code itself is more readable, for the same reason.
I'll summarise my current position as follows:
- Yes, it can be a barrier to getting PRs merged, and we should think of ways to lower that barrier. - Mark's suggestion to focus on the API is excellent and in many cases, that might mollify me. =) - Josh's suggestion that 3D/2D is good enough for almost all uses is correct. My order of preference for code is nD + anisotropy > nD > 3D (and 2D with a flat, length-1 axis) >> 3D/2D (branching code paths) >> 2D. - Although many PRs got stopped on their tracks by an nD requirement, many were improved dramatically by it, for example, `register_translation`. We should exercise judgement when it comes to deciding just how much of a burden it is to convert code to nD, and we should always at least ask ourselves the question. I need to learn to adjust that estimate for the contributor rather than for myself. ;) But, - We can also think of lowering the burden for the contributor. That is, absolutely I need to write a guide for working with nD code, based on all the 2D -> nD conversions we've had in the past. - In many cases, I do think it's preferable to not have an algorithm in skimage than to have it be 2D-only. (I think this of SEEDS, but not of flood_fill.)
Juan.
PS: Could someone review and hopefully approve https://github.com/scikit-image/scikit-image/pull/2043? =P
On Wed, 7 Aug 2019, at 8:01 PM, Josh Warner wrote:
Often with these sort of discussions I end up taking the opposite side regardless of my actual position to make sure we look at all of the important aspects critically. Historically I've been very positive on 3D+ support even before Juan joined the party, but I'm going to examine this from a pragmatic angle.
As someone who works in medical imaging, I am generally very keen on having 3D implementations (and, in a perfect world, these would support anisotropic spacing along at least one dimension because slice thicknesses are very often not the same as the in-plane spacing and reslicing volumes to anisotropic is expensive and tricky in terms of interpolation used). Time isn't a true 4th dimension, and the vast majority of data above 3D is multispectral, rather than volumetric in higher dimensions. I think for the majority of users, 3D anisotropic implementations would be sufficient; how many people come to us with actual 4D+ data? When we're talking N-D from an algorithmic standpoint, the data needs to have N *spatial *dimensions for this abstraction to make real sense. I think this is actually quite rare above N=3, though I could be wrong (I'm interested in any examples you all are aware of which use 4D+ spatial data). The reason I bring this up is because if we hypothetically set the bar at some point, say 3D, we have a lot more options than for N-D in terms of clean JIT and maintainer friendly code.
This is because there is actually a *massive* difference between an algorithm written with nested loops in 2D and 3D separately, versus a truly N-D version of the same algorithm. This is not a quick fix; it's usually an entirely different approach. The massive difference dramatically narrows the potential contributors which could bring such an algorithm to us, and even asking the question of people generally will result in abandoned PRs and - like it or not - the perception that if any potential contributor can't (or doesn't think they can) manage things in N-D they might as well not try. Do we find this acceptable?
The performance and future support implications are also nontrivial. The theoretical nested loop code I mention above is very easy to understand, for reviewers and future maintainers, and is already in an ideal form for Numba to JIT. As Numba continues to improve, the nested loops with branching paths for 2D and 3D might well beat a complex N-D implementation written in Cython (which may have no real pathway to JITting). At SciPy I heard Pandas is considering Numba as a hard dependency. We should probably be paying close attention to that.
An example from direct experience: I had flood fill written in Cython for 2D over 3 years ago (see https://gist.github.com/JDWarner/83808ce4c9374a242905). The 3D case is trivial to add as a separate path, but we wanted this to be N-D. I didn't see an ideal approach to do so, thus it went on the back burner until I saw Lars' work on #3022 <https://github.com/scikit-image/scikit-image/pull/3022>. Borrowing and building on that framework, years later we got N-D flood fill in #3245 <https://github.com/scikit-image/scikit-image/pull/3245>. Along the way there were unintuitive pitfalls to the "ravel approach" as we're calling it in #4088 <https://github.com/scikit-image/scikit-image/issues/4088> which were difficult for me to find, and likely a big challenge for reviewers to catch. This approach also requires the image be padded by 1, then cropped back down, as a way to know where the borders are since we're working on the raveled array. Copying the input and then making it even larger is bad practice for large images and may cause memory issues (nested loop implementations, in contrast, do not require copies or padding). So yes, flood fill is now N-D and there is one code path, but it took over 2 additional years... and would probably be more performant in 2D/3D as nested loops, more maintainable, as well as better future-proofed (Numba). As nice as it is I do not think this algorithm as written could be extracted from Cython and handed to Numba. Would it have been better to have robust 2D and 3D flood fills years earlier?
Food for thought.
Josh
On Wed, Aug 7, 2019 at 7:49 AM Mark Harfouche <mark.harfouche@gmail.com> wrote:
Thank you all for responding.
I general, I agree that the "easy ND contributions" should be addressed ASAP.
That said, it seems that there there is consensus that the ND conversation can be moved later in the PR cycle when most of the other issues have been address, docstrings, testing for the 2D case, API choices, "fit", examples, gallery.
I think many of you rightfully pointed out that good code is often generalizable to ND quite readily. My hope is that by delaying the ND conversion, the PR will be in a really good state, that it will be an "easy fix".
Lars, I will have to have a detailed look at your PR.
On Wed, Aug 7, 2019 at 9:48 AM Alexandre de Siqueira < alex.desiqueira@igdore.org> wrote:
Hey everyone,
Alexandre, something that I want to emphasise right now is that I
would like to consider the next ~12-24 months a sprint to 1.0.
Therefore, I want to avoid as much as possible taking on new
technical debt and focus on fixing/cleaning everything we have so far, according to our (as yet not ratified) values. [1]
We're in the same page, Juan :) Thinking on that, it would be interesting to have something on our model close to Debian Linux's release cycle. We'd work on and receive major contributions up to a certain point (let's say, the next 6-12 months). Then, we clean and debug everything we can, not accepting new major code for that release; it'll be something like Debian's frozen cycle. But I'm digressing.
Kind regards,
Alex
On Wed, 2019-08-07 at 00:24 -0500, Juan Nunez-Iglesias wrote:
Thanks Mark for raising this and Lars and Alex for weighing in.
On Mon, 5 Aug 2019, at 5:40 AM, Alexandre de Siqueira wrote:
I think what is important is to keep the API forward looking for ND. We can do things like accept tuples for parameters, instead of "x_center", "y_center", or "r_center", "c_center". We can also just throw errors when people pass in higher dimensions images saying that it isn't implemented and contributions are welcome.
I think this is critical and really like this compromise. Don't do more work than is actually needed while not making future work harder. Equally critical to me is that whether a function supports ND or only 2D is clearly documented. Maybe after a parameter's type declaration in the docstring?
I guess these are all good solutions for us right now. I'm sure that we'll break, fix and clean lots of stuff/code/API until we reach 1.0 (and that is okay™). This problems will be tackled naturally on the way.
I agree with Lars, keeping the *API* nD-compatible is already a very good step towards keeping the Guardian of the Dimensions happy. =P Great suggestion, Mark!
Alexandre, something that I want to emphasise right now is that I would like to consider the next ~12-24 months a sprint to 1.0. Therefore, I want to avoid as much as possible taking on new technical debt and focus on fixing/cleaning everything we have so far, according to our (as yet not ratified) values. [1]
Optimizing for the different use cases is just tricky, and won't be done correctly if contributors are asked to include something they have no immediate need for.
We can help in some of these cases. If we (core group) can't, we can: 1. discuss if the algorithm is ready as-is to be included and upgraded from there; 2. ask for external community help.
My main idea is that I am happy to jump in and make things nD at the PR stage. However, I don't have the bandwidth to do this for *all* contributions and help is always appreciated. This is why it's important to be on the same page.
Additionally, as I raised in the SEEDS PR [2], a contribution might be worthwhile in nD but not worthwhile in 2D, because 2D implementations in Python already exist. (And in this case they have a good chance of being faster than ours!) We are having enough trouble keeping up with the maintenance burden as it is, so the value of new contributions to the community should be weighed against the maintenance burden.
It might be time to revisit the idea of skimage-contrib, a repo in which functions that don't quite meet this benchmark can live, until they are ready to graduate.
In that vein, wasn't there a discussion about writing a guide on ND-algorithms (e.g. the raveled approach used in watershed, local_maxima, flood_fill)? Was there any progress on this? I think this could be really useful to the community and a good resource for new contributors regardless of how skimage's policy on this turns out.
+1. Or even, contributors aiming to enhance our 3D/nD possibilities, only :)
Yes, I mentioned that I should write a guide and here we are, no further along. Apologies for that. I'll prioritise this work for the summer/fall. If we have an easy guide, including pointing to PRs or commits that convert code to nD, that will make things a lot easier.
Juan.
.. [1] https://github.com/scikit-image/scikit-image/pull/3585 .. [2] https://github.com/scikit-image/scikit-image/pull/1557 _______________________________________________ scikit-image mailing list -- scikit-image@python.org To unsubscribe send an email to scikit-image-leave@python.org
-- -------------------------------------------------- Dr. Alexandre de Siqueira Berkeley Institute for Data Science - BIDS 190 Doe Library University of California, Berkeley Berkeley, CA 94720 United States
Lattes CV: 3936721630855880 ORCID: 0000-0003-1320-4347 Github: alexdesiqueira -------------------------------------------------- _______________________________________________ scikit-image mailing list -- scikit-image@python.org To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________ scikit-image mailing list -- scikit-image@python.org To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________ scikit-image mailing list -- scikit-image@python.org To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________ scikit-image mailing list -- scikit-image@python.org To unsubscribe send an email to scikit-image-leave@python.org