On Thu, Aug 8, 2019 at 1:51 AM Juan Nunez-Iglesias <jni@fastmail.com> wrote:

I have actual work to do but Josh activated this instinct within me. =P I also wouldn't be much of a Guardian of the Dimensions if I didn't respond to a full frontal assault on said dimensions. =P

I agree 100% that most data is either 2D or 3D. *Sometimes* have 3D movies (3D + t), and if the time resolution is fine enough, then a 4D algorithm is advantageous. But that's rare. However, such 4D data is often kept in a 4D array (and sometimes even 5D where the 5th dimension is samples), and it's easier to write

ndimage.gaussian_filter(arr5d, sigma=(0, 0, 1, 5, 5))

than to do two nested for-loops. This is doubly true now that we have dask arrays.

But anyway, I accept that 2D+3D is good enough, in terms of utility.

What I absolutely don't accept is that either maintainability or performance need to suffer to go nD.

My talk at SciPy about getting maximum performance out of Numba was all based on an nD implementation. All the same issues arise in 2D/3D and nD. The main issue with Numba or Cython is dealing with complex data structures such as heaps, queues, priority queues, etc. Those issues are identical for 2D and nD code.

The padding of a volume to avoid neighbor issues was done for convenience. It can and should be removed, and checking for out-of-bounds access is a mod (%) calculation away. Indeed, when it comes to performance, and optimising loop order for cache-efficient memory access, you can *only* do it correctly if you abstract away the axes. If you do for x in im.shape[0]: for y in im.shape[1]:, and now you want to adapt your code to a Fortran-ordered array, it's very hard. If your logic is instead something like for ax in range(im.ndim): for x in range(im.shape[ax]): ..., you can replace the range(ndim) with sorted(range(ndim), key=lambda i: im.strides[i]).

In short, nD opens up *more* avenues for optimisation, not fewer.

When it comes to maintainability:

1 - branching code paths (2D/3D) are not more maintainable, they are less maintainable. If something needs to be changed in the future (say, adding tolerance to the neighbour check, or adding an arbitrary predicate for adding a neighbour), it now needs to be changed in several places.
2 - 2D code and especially 3D code like the one you linked is locally simple, but globally complex. It's hard to know if you've missed a direction. It's hard to adapt to different neighbourhoods.
3 - And it results in many more lines of code (you would have 26 if statements for full connectivity in 3D), among which errors can hide.

In many cases, changing the code to nD makes it *more* readable and maintainable, because magic values get replaced by a principled iteration over the possible axes/axis combinations. See for example my conversion of Hessian filter to nD:

https://github.com/scikit-image/scikit-image/pull/2194/files#diff-70ef54b6b626ce40eefa32dae16fc130L150-R159

(which resulted in a nice speedup btw!)

And who can forget Vighnesh's nD RAG code? Compare the 2D/3D approach to the nD one:

https://github.com/elegant-scipy/elegant-scipy/blob/master/markdown/ch3.markdown#elegant-ndimage-how-to-build-graphs-from-image-regions

I find it quite sobering to thing that this code might never had existed if I hadn't prodded Vighnesh to go for nD.

I argue that nD code is a bit like 0-indexing vs 1-indexing: at first 0-indexing is weirder, less intuitive, and you make more off-by-one errors because you're not used to thinking about it. But after just a bit of practice, it becomes intuitive, you make *fewer* errors because you are not adding and subtracting 1 all over the place, and the code itself is more readable, for the same reason.

I'll summarise my current position as follows:

- Yes, it can be a barrier to getting PRs merged, and we should think of ways to lower that barrier.
- Mark's suggestion to focus on the API is excellent and in many cases, that might mollify me. =)
- Josh's suggestion that 3D/2D is good enough for almost all uses is correct. My order of preference for code is nD + anisotropy > nD > 3D (and 2D with a flat, length-1 axis) >> 3D/2D (branching code paths) >> 2D.
- Although many PRs got stopped on their tracks by an nD requirement, many were improved dramatically by it, for example, `register_translation`. We should exercise judgement when it comes to deciding just how much of a burden it is to convert code to nD, and we should always at least ask ourselves the question. I need to learn to adjust that estimate for the contributor rather than for myself. ;) But,
- We can also think of lowering the burden for the contributor. That is, absolutely I need to write a guide for working with nD code, based on all the 2D -> nD conversions we've had in the past.
- In many cases, I do think it's preferable to not have an algorithm in skimage than to have it be 2D-only. (I think this of SEEDS, but not of flood_fill.)

Juan.

PS: Could someone review and hopefully approve https://github.com/scikit-image/scikit-image/pull/2043? =P

On Wed, 7 Aug 2019, at 8:01 PM, Josh Warner wrote:
Often with these sort of discussions I end up taking the opposite side regardless of my actual position to make sure we look at all of the important aspects critically. Historically I've been very positive on 3D+ support even before Juan joined the party, but I'm going to examine this from a pragmatic angle.

As someone who works in medical imaging, I am generally very keen on having 3D implementations (and, in a perfect world, these would support anisotropic spacing along at least one dimension because slice thicknesses are very often not the same as the in-plane spacing and reslicing volumes to anisotropic is expensive and tricky in terms of interpolation used). Time isn't a true 4th dimension, and the vast majority of data above 3D is multispectral, rather than volumetric in higher dimensions. I think for the majority of users, 3D anisotropic implementations would be sufficient; how many people come to us with actual 4D+ data? When we're talking N-D from an algorithmic standpoint, the data needs to have N spatial dimensions for this abstraction to make real sense. I think this is actually quite rare above N=3, though I could be wrong (I'm interested in any examples you all are aware of which use 4D+ spatial data). The reason I bring this up is because if we hypothetically set the bar at some point, say 3D, we have a lot more options than for N-D in terms of clean JIT and maintainer friendly code.

This is because there is actually a massive difference between an algorithm written with nested loops in 2D and 3D separately, versus a truly N-D version of the same algorithm. This is not a quick fix; it's usually an entirely different approach. The massive difference dramatically narrows the potential contributors which could bring such an algorithm to us, and even asking the question of people generally will result in abandoned PRs and - like it or not - the perception that if any potential contributor can't (or doesn't think they can) manage things in N-D they might as well not try. Do we find this acceptable?

The performance and future support implications are also nontrivial. The theoretical nested loop code I mention above is very easy to understand, for reviewers and future maintainers, and is already in an ideal form for Numba to JIT. As Numba continues to improve, the nested loops with branching paths for 2D and 3D might well beat a complex N-D implementation written in Cython (which may have no real pathway to JITting). At SciPy I heard Pandas is considering Numba as a hard dependency. We should probably be paying close attention to that.

An example from direct experience: I had flood fill written in Cython for 2D over 3 years ago (see https://gist.github.com/JDWarner/83808ce4c9374a242905). The 3D case is trivial to add as a separate path, but we wanted this to be N-D. I didn't see an ideal approach to do so, thus it went on the back burner until I saw Lars' work on #3022. Borrowing and building on that framework, years later we got N-D flood fill in #3245. Along the way there were unintuitive pitfalls to the "ravel approach" as we're calling it in #4088 which were difficult for me to find, and likely a big challenge for reviewers to catch. This approach also requires the image be padded by 1, then cropped back down, as a way to know where the borders are since we're working on the raveled array. Copying the input and then making it even larger is bad practice for large images and may cause memory issues (nested loop implementations, in contrast, do not require copies or padding). So yes, flood fill is now N-D and there is one code path, but it took over 2 additional years... and would probably be more performant in 2D/3D as nested loops, more maintainable, as well as better future-proofed (Numba). As nice as it is I do not think this algorithm as written could be extracted from Cython and handed to Numba. Would it have been better to have robust 2D and 3D flood fills years earlier?

Food for thought.

Josh

On Wed, Aug 7, 2019 at 7:49 AM Mark Harfouche <mark.harfouche@gmail.com> wrote:
Thank you all for responding.

I general, I agree that the "easy ND contributions" should be addressed ASAP.

That said, it seems that there there is consensus that the ND conversation can be moved later in the PR cycle when most of the other issues have been address, docstrings, testing for the 2D case, API choices, "fit", examples, gallery.

I think many of you rightfully pointed out that good code is often generalizable to ND quite readily. My hope is that by delaying the ND conversion, the PR will be in a really good state, that it will be an "easy fix".

Lars, I will have to have a detailed look at your PR.

On Wed, Aug 7, 2019 at 9:48 AM Alexandre de Siqueira <alex.desiqueira@igdore.org> wrote:
Hey everyone,

> Alexandre, something that I want to emphasise right now is that I
>
would like to consider the next ~12-24 months a sprint to 1.0.
>
Therefore, I want to avoid as much as possible taking on new
> technical
debt and focus on fixing/cleaning everything we have so
> far, according
to our (as yet not ratified) values. [1]

We're in the same page, Juan :) Thinking on that, it would be
interesting to have something on our model close to Debian Linux's
release cycle. We'd work on and receive major contributions up to a
certain point (let's say, the next 6-12 months). Then, we clean and
debug everything we can, not accepting new major code for that release;
it'll be something like Debian's frozen cycle. But I'm digressing.

Kind regards,

Alex

On Wed, 2019-08-07 at 00:24 -0500, Juan Nunez-Iglesias wrote:
> Thanks Mark for raising this and Lars and Alex for weighing in.
>
> On Mon, 5 Aug 2019, at 5:40 AM, Alexandre de Siqueira wrote:
> > > > I think what is important is to keep the API forward looking
> > > > for
> > > > ND. We can do things like accept tuples for parameters, instead
> > > > of
> > > > "x_center", "y_center", or "r_center", "c_center". We can also
> > > > just throw errors when people pass in higher dimensions images
> > > > saying that it isn't implemented and contributions are welcome.
> > >
> > > I think this is critical and really like this compromise. Don't
> > > do
> > > more work than is actually needed while not making future work
> > > harder. Equally critical to me is that whether a function
> > > supports
> > > ND or only 2D is clearly documented. Maybe after a parameter's
> > > type
> > > declaration in the docstring?
> >
> > I guess these are all good solutions for us right now. I'm sure
> > that
> > we'll break, fix and clean lots of stuff/code/API until we reach
> > 1.0
> > (and that is okay™). This problems will be tackled naturally on the
> > way.
>
> I agree with Lars, keeping the *API* nD-compatible is already a very
> good step towards keeping the Guardian of the Dimensions happy. =P
> Great suggestion, Mark!
>
> Alexandre, something that I want to emphasise right now is that I
> would like to consider the next ~12-24 months a sprint to 1.0.
> Therefore, I want to avoid as much as possible taking on new
> technical debt and focus on fixing/cleaning everything we have so
> far, according to our (as yet not ratified) values. [1]
>
> > > > Optimizing for the different use cases is just tricky, and
> > > > won't
> > > > be done correctly if contributors are asked to include
> > > > something
> > > > they have no immediate need for.
> >
> > We can help in some of these cases. If we (core group) can't, we
> > can:
> > 1. discuss if the algorithm is ready as-is to be included and
> > upgraded
> > from there; 2. ask for external community help.
>
> My main idea is that I am happy to jump in and make things nD at the
> PR stage. However, I don't have the bandwidth to do this for *all*
> contributions and help is always appreciated. This is why it's
> important to be on the same page.
>
> Additionally, as I raised in the SEEDS PR [2], a contribution might
> be worthwhile in nD but not worthwhile in 2D, because 2D
> implementations in Python already exist. (And in this case they have
> a good chance of being faster than ours!) We are having enough
> trouble keeping up with the maintenance burden as it is, so the value
> of new contributions to the community should be weighed against the
> maintenance burden.
>
> It might be time to revisit the idea of skimage-contrib, a repo in
> which functions that don't quite meet this benchmark can live, until
> they are ready to graduate.
>
> > > In that vein, wasn't there a discussion about writing a guide on
> > > ND-algorithms (e.g. the raveled approach used in watershed,
> > > local_maxima, flood_fill)? Was there any progress on this? I
> > > think
> > > this could be really useful to the community and a good resource
> > > for new contributors regardless of how skimage's policy on this
> > > turns out.
> >
> > +1. Or even, contributors aiming to enhance our 3D/nD
> > possibilities,
> > only :)
>
> Yes, I mentioned that I should write a guide and here we are, no
> further along. Apologies for that. I'll prioritise this work for the
> summer/fall. If we have an easy guide, including pointing to PRs or
> commits that convert code to nD, that will make things a lot easier.
>
> Juan.
>
> .. [1] https://github.com/scikit-image/scikit-image/pull/3585
> .. [2] https://github.com/scikit-image/scikit-image/pull/1557
> _______________________________________________
> scikit-image mailing list -- scikit-image@python.org
> To unsubscribe send an email to scikit-image-leave@python.org
>
--
--------------------------------------------------
Dr. Alexandre de Siqueira
Berkeley Institute for Data Science - BIDS
190 Doe Library
University of California, Berkeley
Berkeley, CA 94720
United States

Lattes CV: 3936721630855880
ORCID: 0000-0003-1320-4347
Github: alexdesiqueira
--------------------------------------------------
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org

_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org