[scikit-image] Re: ND requirements on new contributions

9 Aug 2019

      Hey all,

Thanks for the nice discussion.

As far as the ND discussion goes it seems:
1. The inclusion of new algorithms should be weighed in terms of their
availability in other popular image processing libraries. Can scikit-image
provide any enhancement?
2. If so, a 2D-only algorithm should not necessary block the contribution.
3. Cleanups and enhancements should proceed while motivating the need for
an API that is compatible with ND extensions.
4. Rethink the ND-ification of the algorithm when most other issues have
been addressed.

There seems to be an other discussion brewing, about how to experiment with
different memory models and coding styles that might be (in)compatible with
arrays such as dask and cupy.
Somebody should probalby open up why we should invest in numba.

Mark

On Thu, Aug 8, 2019 at 1:51 AM Juan Nunez-Iglesias <jni@fastmail.com> wrote:
...
I have actual work to do but Josh activated this instinct
<https://xkcd.com/386/> within me. =P I also wouldn't be much of a
Guardian of the Dimensions if I didn't respond to a full frontal assault on
said dimensions. =P
I agree 100% that most data is either 2D or 3D. *Sometimes* have 3D movies
(3D + t), and if the time resolution is fine enough, then a 4D algorithm is
advantageous. But that's rare. However, such 4D data is often kept in a 4D
array (and sometimes even 5D where the 5th dimension is samples), and it's
easier to write
ndimage.gaussian_filter(arr5d, sigma=(0, 0, 1, 5, 5))
than to do two nested for-loops. This is doubly true now that we have dask
arrays.
But anyway, I accept that 2D+3D is good enough, in terms of utility.
What I absolutely don't accept is that either maintainability or
performance need to suffer to go nD.
My talk at SciPy about getting maximum performance out of Numba was all
based on an nD implementation. All the same issues arise in 2D/3D and nD.
The main issue with Numba or Cython is dealing with complex data structures
such as heaps, queues, priority queues, etc. Those issues are identical for
2D and nD code.
The padding of a volume to avoid neighbor issues was done for convenience.
It can and should be removed, and checking for out-of-bounds access is a
mod (%) calculation away. Indeed, when it comes to performance, and
optimising loop order for cache-efficient memory access, you can *only* do
it correctly if you abstract away the axes. If you do for x in im.shape[0]:
for y in im.shape[1]:, and now you want to adapt your code to a
Fortran-ordered array, it's very hard. If your logic is instead something
like for ax in range(im.ndim): for x in range(im.shape[ax]): ..., you can
replace the range(ndim) with sorted(range(ndim), key=lambda i:
im.strides[i]).
In short, nD opens up *more* avenues for optimisation, not fewer.
When it comes to maintainability:
1 - branching code paths (2D/3D) are not more maintainable, they are less
maintainable. If something needs to be changed in the future (say, adding
tolerance to the neighbour check, or adding an arbitrary predicate for
adding a neighbour), it now needs to be changed in several places.
2 - 2D code and especially 3D code like the one you linked is locally
simple, but globally complex. It's hard to know if you've missed a
direction. It's hard to adapt to different neighbourhoods.
3 - And it results in many more lines of code (you would have 26 if
statements for full connectivity in 3D), among which errors can hide.
In many cases, changing the code to nD makes it *more* readable and
maintainable, because magic values get replaced by a principled iteration
over the possible axes/axis combinations. See for example my conversion of
Hessian filter to nD:
https://github.com/scikit-image/scikit-image/pull/2194/files#diff-70ef54b6b6...
(which resulted in a nice speedup btw!)
And who can forget Vighnesh's nD RAG code? Compare the 2D/3D approach to
the nD one:
https://github.com/elegant-scipy/elegant-scipy/blob/master/markdown/ch3.mark...
I find it quite sobering to thing that this code might never had existed
if I hadn't prodded Vighnesh to go for nD.
I argue that nD code is a bit like 0-indexing vs 1-indexing: at first
0-indexing is weirder, less intuitive, and you make more off-by-one errors
because you're not used to thinking about it. But after just a bit of
practice, it becomes intuitive, you make *fewer* errors because you are not
adding and subtracting 1 all over the place, and the code itself is more
readable, for the same reason.
I'll summarise my current position as follows:
- Yes, it can be a barrier to getting PRs merged, and we should think of
ways to lower that barrier.
- Mark's suggestion to focus on the API is excellent and in many cases,
that might mollify me. =)
- Josh's suggestion that 3D/2D is good enough for almost all uses is
correct. My order of preference for code is nD + anisotropy > nD > 3D (and
2D with a flat, length-1 axis) >> 3D/2D (branching code paths) >> 2D.
- Although many PRs got stopped on their tracks by an nD requirement, many
were improved dramatically by it, for example, `register_translation`. We
should exercise judgement when it comes to deciding just how much of a
burden it is to convert code to nD, and we should always at least ask
ourselves the question. I need to learn to adjust that estimate for the
contributor rather than for myself. ;) But,
- We can also think of lowering the burden for the contributor. That is,
absolutely I need to write a guide for working with nD code, based on all
the 2D -> nD conversions we've had in the past.
- In many cases, I do think it's preferable to not have an algorithm in
skimage than to have it be 2D-only. (I think this of SEEDS, but not of
flood_fill.)
Juan.
PS: Could someone review and hopefully approve
https://github.com/scikit-image/scikit-image/pull/2043? =P
On Wed, 7 Aug 2019, at 8:01 PM, Josh Warner wrote:
Often with these sort of discussions I end up taking the opposite side
regardless of my actual position to make sure we look at all of the
important aspects critically.  Historically I've been very positive on 3D+
support even before Juan joined the party, but I'm going to examine this
from a pragmatic angle.
As someone who works in medical imaging, I am generally very keen on
having 3D implementations (and, in a perfect world, these would support
anisotropic spacing along at least one dimension because slice thicknesses
are very often not the same as the in-plane spacing and reslicing volumes
to anisotropic is expensive and tricky in terms of interpolation used).
Time isn't a true 4th dimension, and the vast majority of data above 3D is
multispectral, rather than volumetric in higher dimensions.  I think for
the majority of users, 3D anisotropic implementations would be sufficient;
how many people come to us with actual 4D+ data?  When we're talking N-D
from an algorithmic standpoint, the data needs to have N *spatial *dimensions
for this abstraction to make real sense.  I think this is actually quite
rare above N=3, though I could be wrong (I'm interested in any examples you
all are aware of which use 4D+ spatial data).  The reason I bring this up
is because if we hypothetically set the bar at some point, say 3D, we have
a lot more options than for N-D in terms of clean JIT and maintainer
friendly code.
This is because there is actually a *massive* difference between an
algorithm written with nested loops in 2D and 3D separately, versus a truly
N-D version of the same algorithm.  This is not a quick fix; it's usually
an entirely different approach.  The massive difference dramatically
narrows the potential contributors which could bring such an algorithm to
us, and even asking the question of people generally will result in
abandoned PRs and - like it or not - the perception that if any potential
contributor can't (or doesn't think they can) manage things in N-D they
might as well not try.  Do we find this acceptable?
The performance and future support implications are also nontrivial.  The
theoretical nested loop code I mention above is very easy to understand,
for reviewers and future maintainers, and is already in an ideal form for
Numba to JIT.  As Numba continues to improve, the nested loops with
branching paths for 2D and 3D might well beat a complex N-D implementation
written in Cython (which may have no real pathway to JITting).  At SciPy I
heard Pandas is considering Numba as a hard dependency.  We should probably
be paying close attention to that.
An example from direct experience: I had flood fill written in Cython for
2D over 3 years ago (see
https://gist.github.com/JDWarner/83808ce4c9374a242905).  The 3D case is
trivial to add as a separate path, but we wanted this to be N-D.  I didn't
see an ideal approach to do so, thus it went on the back burner until I saw
Lars' work on #3022
<https://github.com/scikit-image/scikit-image/pull/3022>.  Borrowing and
building on that framework, years later we got N-D flood fill in #3245
<https://github.com/scikit-image/scikit-image/pull/3245>.  Along the way
there were unintuitive pitfalls to the "ravel approach" as we're calling it
in #4088 <https://github.com/scikit-image/scikit-image/issues/4088> which
were difficult for me to find, and likely a big challenge for reviewers to
catch.  This approach also requires the image be padded by 1, then cropped
back down, as a way to know where the borders are since we're working on
the raveled array.  Copying the input and then making it even larger is bad
practice for large images and may cause memory issues (nested loop
implementations, in contrast, do not require copies or padding).  So yes,
flood fill is now N-D and there is one code path, but it took over 2
additional years... and would probably be more performant in 2D/3D as
nested loops, more maintainable, as well as better future-proofed (Numba).
As nice as it is I do not think this algorithm as written could be
extracted from Cython and handed to Numba.  Would it have been better to
have robust 2D and 3D flood fills years earlier?
Food for thought.
Josh
On Wed, Aug 7, 2019 at 7:49 AM Mark Harfouche <mark.harfouche@gmail.com>
wrote:
Thank you all for responding.
I general, I agree that the "easy ND contributions" should be addressed
ASAP.
That said, it seems that there there is consensus that the ND conversation
can be moved later in the PR cycle when most of the other issues have been
address, docstrings, testing for the 2D case, API choices, "fit", examples,
gallery.
I think many of you rightfully pointed out that good code is often
generalizable to ND quite readily. My hope is that by delaying the ND
conversion, the PR will be in a really good state, that it will be an "easy
fix".
Lars, I will have to have a detailed look at your PR.
On Wed, Aug 7, 2019 at 9:48 AM Alexandre de Siqueira <
alex.desiqueira@igdore.org> wrote:
Hey everyone,
...
Alexandre, something that I want to emphasise right now is that I
would like to consider the next ~12-24 months a sprint to 1.0.
...
Therefore, I want to avoid as much as possible taking on new
...
technical
debt and focus on fixing/cleaning everything we have so
far, according
to our (as yet not ratified) values. [1]
We're in the same page, Juan :) Thinking on that, it would be
interesting to have something on our model close to Debian Linux's
release cycle. We'd work on and receive major contributions up to a
certain point (let's say, the next 6-12 months). Then, we clean and
debug everything we can, not accepting new major code for that release;
it'll be something like Debian's frozen cycle. But I'm digressing.
Kind regards,
Alex
On Wed, 2019-08-07 at 00:24 -0500, Juan Nunez-Iglesias wrote:
...
Thanks Mark for raising this and Lars and Alex for weighing in.
On Mon, 5 Aug 2019, at 5:40 AM, Alexandre de Siqueira wrote:
...
...
...
I think what is important is to keep the API forward looking
for
ND. We can do things like accept tuples for parameters, instead
of
"x_center", "y_center", or "r_center", "c_center". We can also
just throw errors when people pass in higher dimensions images
saying that it isn't implemented and contributions are welcome.
I think this is critical and really like this compromise. Don't
do
more work than is actually needed while not making future work
harder. Equally critical to me is that whether a function
supports
ND or only 2D is clearly documented. Maybe after a parameter's
type
declaration in the docstring?
I guess these are all good solutions for us right now. I'm sure
that
we'll break, fix and clean lots of stuff/code/API until we reach
1.0
(and that is okay™). This problems will be tackled naturally on the
way.
I agree with Lars, keeping the *API* nD-compatible is already a very
good step towards keeping the Guardian of the Dimensions happy. =P
Great suggestion, Mark!
Alexandre, something that I want to emphasise right now is that I
would like to consider the next ~12-24 months a sprint to 1.0.
Therefore, I want to avoid as much as possible taking on new
technical debt and focus on fixing/cleaning everything we have so
far, according to our (as yet not ratified) values. [1]
...
...
...
Optimizing for the different use cases is just tricky, and
won't
be done correctly if contributors are asked to include
something
they have no immediate need for.
We can help in some of these cases. If we (core group) can't, we
can:
1. discuss if the algorithm is ready as-is to be included and
upgraded
from there; 2. ask for external community help.
My main idea is that I am happy to jump in and make things nD at the
PR stage. However, I don't have the bandwidth to do this for *all*
contributions and help is always appreciated. This is why it's
important to be on the same page.
Additionally, as I raised in the SEEDS PR [2], a contribution might
be worthwhile in nD but not worthwhile in 2D, because 2D
implementations in Python already exist. (And in this case they have
a good chance of being faster than ours!) We are having enough
trouble keeping up with the maintenance burden as it is, so the value
of new contributions to the community should be weighed against the
maintenance burden.
It might be time to revisit the idea of skimage-contrib, a repo in
which functions that don't quite meet this benchmark can live, until
they are ready to graduate.
...
...
In that vein, wasn't there a discussion about writing a guide on
ND-algorithms (e.g. the raveled approach used in watershed,
local_maxima, flood_fill)? Was there any progress on this? I
think
this could be really useful to the community and a good resource
for new contributors regardless of how skimage's policy on this
turns out.
+1. Or even, contributors aiming to enhance our 3D/nD
possibilities,
only :)
Yes, I mentioned that I should write a guide and here we are, no
further along. Apologies for that. I'll prioritise this work for the
summer/fall. If we have an easy guide, including pointing to PRs or
commits that convert code to nD, that will make things a lot easier.
Juan.
.. [1] https://github.com/scikit-image/scikit-image/pull/3585
.. [2] https://github.com/scikit-image/scikit-image/pull/1557
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org
--
--------------------------------------------------
Dr. Alexandre de Siqueira
Berkeley Institute for Data Science - BIDS
190 Doe Library
University of California, Berkeley
Berkeley, CA 94720
United States
Lattes CV: 3936721630855880
ORCID: 0000-0003-1320-4347
Github: alexdesiqueira
--------------------------------------------------
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org
_______________________________________________
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org