Repository for GSoC project
Hi all, This year I am mentoring Aman for one of the GSoC projects we have underway, "Rewriting ndimage in Cython." By its very nature it doesn't conform very well to the "many small pull requests" model: from the point of view of scipy, things are going to be broken up until almost the very last commit. I am not sure what the best way to set up a collaborative code development environment would be, and so am asking for the collective wisdom to help guide us. Aman could simply create one ginormous pull request that will grow, and grow, and not be merged until everything was ready. I don't like this idea too much, as it is going to eventually be a confusing mess, and I think it would also make it difficult for others than Aman (that would mostly be me) to contribute code. I think we could also use a branch, either on my fork of scipy or on Aman's, as the repository on which development would happen, and against which PRs would be created, and once completed send a single PR to the main scipy repo. This may work, but I don't like it much either. What probably makes more sense is to create a new branch **in the main scipy repository**, and have PRs sent and merged against that branch, which would eventually be merged with master upon completion. NumPy seems to have a couple such experimental branches ('with_maskna' and 'enable_separate_by_default'), although there is none in SciPy that I see. This would also allow us to keep the project in a controlled environment, even if by the end of the summer not every single bit of ndimage has been ported. If this third path is really the preferred way of doing things, I could probably set things up myself (Ralf gave me commit rights when I became a mentor for this project), but I'd like to hear what others think, before abusing my powers. Thanks! Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
On Mon, May 25, 2015 at 3:37 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi all,
This year I am mentoring Aman for one of the GSoC projects we have underway, "Rewriting ndimage in Cython." By its very nature it doesn't conform very well to the "many small pull requests" model: from the point of view of scipy, things are going to be broken up until almost the very last commit. I am not sure what the best way to set up a collaborative code development environment would be, and so am asking for the collective wisdom to help guide us.
Aman could simply create one ginormous pull request that will grow, and grow, and not be merged until everything was ready. I don't like this idea too much, as it is going to eventually be a confusing mess, and I think it would also make it difficult for others than Aman (that would mostly be me) to contribute code.
I think we could also use a branch, either on my fork of scipy or on Aman's, as the repository on which development would happen, and against which PRs would be created, and once completed send a single PR to the main scipy repo. This may work, but I don't like it much either.
What probably makes more sense is to create a new branch **in the main scipy repository**, and have PRs sent and merged against that branch, which would eventually be merged with master upon completion. NumPy seems to have a couple such experimental branches ('with_maskna' and 'enable_separate_by_default'), although there is none in SciPy that I see. This would also allow us to keep the project in a controlled environment, even if by the end of the summer not every single bit of ndimage has been ported.
If this third path is really the preferred way of doing things, I could probably set things up myself (Ralf gave me commit rights when I became a mentor for this project), but I'd like to hear what others think, before abusing my powers.
I don't see much difference between options two and three for working with github since it''s easy to create and merge pull requests across forks. One consideration is whether you want to trigger the TravisCI runs, which I guess would be automatic with a branch in the main scipy repo. Another consideration is whether this triggers a large amount of notifications for scipy developers that are subscribed to changes, PRs and issues. (and if it's easy for me to filter those out visually in gmail) In statsmodels all the extra branches in the main repo are stale or stalled and are waiting for someone to pick up. Actual development is in developer forks. (I'm not involved enough in scipy development to have an opinion.) Josef
Thanks!
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
I would strongly advocate not waiting for a full rewrite to try to integrate the code. I would strongly push to identify the easiest places to rewrite and to try to get code merged for these ASAP. In my experience of project management, this is just a better way of doing things. It helps identifying problems early. It helps breaking down the contribution into something reviewable. It helps making sure that everybody is on the same page. My 2 euro cents, Gaël On Mon, May 25, 2015 at 04:10:21PM -0400, josef.pktd@gmail.com wrote:
On Mon, May 25, 2015 at 3:37 PM, Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Hi all,
This year I am mentoring Aman for one of the GSoC projects we have underway, "Rewriting ndimage in Cython." By its very nature it doesn't conform very well to the "many small pull requests" model: from the point of view of scipy, things are going to be broken up until almost the very last commit. I am not sure what the best way to set up a collaborative code development environment would be, and so am asking for the collective wisdom to help guide us.
Aman could simply create one ginormous pull request that will grow, and grow, and not be merged until everything was ready. I don't like this idea too much, as it is going to eventually be a confusing mess, and I think it would also make it difficult for others than Aman (that would mostly be me) to contribute code.
I think we could also use a branch, either on my fork of scipy or on Aman's, as the repository on which development would happen, and against which PRs would be created, and once completed send a single PR to the main scipy repo. This may work, but I don't like it much either.
What probably makes more sense is to create a new branch **in the main scipy repository**, and have PRs sent and merged against that branch, which would eventually be merged with master upon completion. NumPy seems to have a couple such experimental branches ('with_maskna' and 'enable_separate_by_default'), although there is none in SciPy that I see. This would also allow us to keep the project in a controlled environment, even if by the end of the summer not every single bit of ndimage has been ported.
If this third path is really the preferred way of doing things, I could probably set things up myself (Ralf gave me commit rights when I became a mentor for this project), but I'd like to hear what others think, before abusing my powers.
I don't see much difference between options two and three for working with github since it''s easy to create and merge pull requests across forks.
One consideration is whether you want to trigger the TravisCI runs, which I guess would be automatic with a branch in the main scipy repo.
Another consideration is whether this triggers a large amount of notifications for scipy developers that are subscribed to changes, PRs and issues. (and if it's easy for me to filter those out visually in gmail)
In statsmodels all the extra branches in the main repo are stale or stalled and are waiting for someone to pick up. Actual development is in developer forks.
(I'm not involved enough in scipy development to have an opinion.)
Josef
Thanks!
Jaime
-- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
On Mon, May 25, 2015 at 1:15 PM, Gael Varoquaux < gael.varoquaux@normalesup.org> wrote:
I would strongly advocate not waiting for a full rewrite to try to integrate the code. I would strongly push to identify the easiest places to rewrite and to try to get code merged for these ASAP.
That may be hard for ndimage, let me explain the rationale... ndimage is arranged in three or four layers: 1. there's a Python layer, 2. which calls on Python functions written in C, 3. which calls on C functions, 4. which use a few low level C "objects", which also have some hierarchy among them. We could simply start translating at level 2, and make our way down to level 4. Once completed, the stated goal of rewriting ndimage in Cython would have been achieved, but I am afraid that it wouldn't help at all with the improving maintainability goal: levels 3 and 4 above are a mess of poorly documented code, with little regard for separation of concerns, and that is not going to be solved by translating the same code structure to another language. The plan we are trying to follow is: 1. Start at level 4, rewriting the lowest of the lowest C "objects." This will include not just translating to Cython, but better encapsulation of internal functionality, and providing a cleaner API for upstream use. 2. As soon as enough work has been done at level 4 that it can be bubbled all the way up to level 1, go ahead with the bubbling up, for 1 or 2 level 1 functions at most. 3. Validate that things are working correctly, benchmark against the current implementation, adapt as needed. 4. Go back to step 1, and keep working on level 4 rewriting, until there is a new chance to bubble the changes up. 5. Eventually, once level 4 has been completely rewritten, a more up-down approach can probably be followed. So in a way we are indeed going to try to integrate things ASAP, only soon is going to be later than one may otherwise expect. Now that I have explained it in writing, I guess we could try to follow this approach directly in the current ndimage repository, keeping the old code path alive until it was no longer needed... Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
ndimage is arranged in three or four layers:
1. there's a Python layer, 2. which calls on Python functions written in C, 3. which calls on C functions, 4. which use a few low level C "objects", which also have some hierarchy among them.
The plan we are trying to follow is:
1. Start at level 4, rewriting the lowest of the lowest C "objects." This will include not just translating to Cython, but better encapsulation of internal functionality, and providing a cleaner API for upstream use.
o_O How shall I say this in a polite way? "If this was an exam, I would have brought down the hammer and failed you." But that is not very polite, so perhaps this: "Please reconsider?" Sturla
On Thu, May 28, 2015 at 1:27 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
ndimage is arranged in three or four layers:
1. there's a Python layer, 2. which calls on Python functions written in C, 3. which calls on C functions, 4. which use a few low level C "objects", which also have some hierarchy among them.
The plan we are trying to follow is:
1. Start at level 4, rewriting the lowest of the lowest C "objects." This will include not just translating to Cython, but better encapsulation of internal functionality, and providing a cleaner API for upstream use.
o_O
How shall I say this in a polite way?
"If this was an exam, I would have brought down the hammer and failed you."
But that is not very polite, so perhaps this:
"Please reconsider?"
You're only quoting part of the approach. The next sentence was "As soon as enough work has been done at level 4 that it can be bubbled all the way up to level 1, go ahead with the bubbling up, for 1 or 2 level 1 functions at most." Building some of the needed low-level infrastructure and then going straight for re-implementing a first public function in Cython seems like the right approach. It's not like you can do much in ndimage without any of the existing iterators. So replacing those with usage of standard iterators in numpy makes sense. Ralf
Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
I would strongly advocate not waiting for a full rewrite to try to integrate the code. I would strongly push to identify the easiest places to rewrite and to try to get code merged for these ASAP.
Yes, this. Code that will not build until the very last commit stands a chance of never being completed. And even worse, there will be no feedback along the way. Sturla
Hi all, Long time list lurker, thought I might chime in on this topic quickly -- On Mon, May 25, 2015 at 3:10 PM, <josef.pktd@gmail.com> wrote:
On Mon, May 25, 2015 at 3:37 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi all,
This year I am mentoring Aman for one of the GSoC projects we have underway, "Rewriting ndimage in Cython." By its very nature it doesn't conform very well to the "many small pull requests" model: from the point of view of scipy, things are going to be broken up until almost the very last commit. I am not sure what the best way to set up a collaborative code development environment would be, and so am asking for the collective wisdom to help guide us.
Aman could simply create one ginormous pull request that will grow, and grow, and not be merged until everything was ready. I don't like this idea too much, as it is going to eventually be a confusing mess, and I think it would also make it difficult for others than Aman (that would mostly be me) to contribute code.
I think we could also use a branch, either on my fork of scipy or on Aman's, as the repository on which development would happen, and against which PRs would be created, and once completed send a single PR to the main scipy repo. This may work, but I don't like it much either.
What probably makes more sense is to create a new branch **in the main scipy repository**, and have PRs sent and merged against that branch, which would eventually be merged with master upon completion. NumPy seems to have a couple such experimental branches ('with_maskna' and 'enable_separate_by_default'), although there is none in SciPy that I see. This would also allow us to keep the project in a controlled environment, even if by the end of the summer not every single bit of ndimage has been ported.
If this third path is really the preferred way of doing things, I could probably set things up myself (Ralf gave me commit rights when I became a mentor for this project), but I'd like to hear what others think, before abusing my powers.
I don't see much difference between options two and three for working with github since it''s easy to create and merge pull requests across forks.
One consideration is whether you want to trigger the TravisCI runs, which I guess would be automatic with a branch in the main scipy repo.
I think involving CI is a nice (who doesn't love seeing the green check mark?). Since scipy already has all of the Travis plumbing, Jaime or Aman could develop on a branch and push to their respective fork after having "turned on" (they use the term "flick on") that branch on Travis. This strategy would remove scipy list notifications while leveraging the usefulness of CI. The only issue is that if the work is restricted to Aman/Jaime's forks, then PR reviewing would likely have fewer eyes until the final PR into scipy. Just a thought, Matt
Another consideration is whether this triggers a large amount of notifications for scipy developers that are subscribed to changes, PRs and issues. (and if it's easy for me to filter those out visually in gmail)
In statsmodels all the extra branches in the main repo are stale or stalled and are waiting for someone to pick up. Actual development is in developer forks.
(I'm not involved enough in scipy development to have an opinion.)
Josef
Thanks!
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
-- Matthew Gidden, Ph.D. Postdoctoral Associate, Nuclear Engineering The University of Wisconsin -- Madison Ph. 225.892.3192
On Mon, May 25, 2015 at 1:10 PM, <josef.pktd@gmail.com> wrote:
On Mon, May 25, 2015 at 3:37 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi all,
This year I am mentoring Aman for one of the GSoC projects we have underway, "Rewriting ndimage in Cython." By its very nature it doesn't conform very well to the "many small pull requests" model: from the point of view of scipy, things are going to be broken up until almost the very last commit. I am not sure what the best way to set up a collaborative code development environment would be, and so am asking for the collective wisdom to help guide us.
Aman could simply create one ginormous pull request that will grow, and grow, and not be merged until everything was ready. I don't like this idea too much, as it is going to eventually be a confusing mess, and I think it would also make it difficult for others than Aman (that would mostly be me) to contribute code.
I think we could also use a branch, either on my fork of scipy or on Aman's, as the repository on which development would happen, and against which PRs would be created, and once completed send a single PR to the main scipy repo. This may work, but I don't like it much either.
What probably makes more sense is to create a new branch **in the main scipy repository**, and have PRs sent and merged against that branch, which would eventually be merged with master upon completion. NumPy seems to have a couple such experimental branches ('with_maskna' and 'enable_separate_by_default'), although there is none in SciPy that I see. This would also allow us to keep the project in a controlled environment, even if by the end of the summer not every single bit of ndimage has been ported.
If this third path is really the preferred way of doing things, I could probably set things up myself (Ralf gave me commit rights when I became a mentor for this project), but I'd like to hear what others think, before abusing my powers.
I don't see much difference between options two and three for working with github since it''s easy to create and merge pull requests across forks.
One consideration is whether you want to trigger the TravisCI runs, which I guess would be automatic with a branch in the main scipy repo.
Another consideration is whether this triggers a large amount of notifications for scipy developers that are subscribed to changes, PRs and issues. (and if it's easy for me to filter those out visually in gmail)
I understand the concern: I am myself often baffled by the amount of e-mail from the scipy repo that I delete without reading. That said, what you sign up for when you sign up to the SciPy github repo is to receive all notifications related to that repo, which does include ndimage... And I don't think that the fact that these changes will be part of GSoC warrants treating them in a special way. Also, given the volume of mail that SciPy already generates, this will probably be lost like tears in the rain anyway... As a more general thing, it may make sense to come up with three letter acronyms for each of the scipy submodules, and ask that PRs identify the submodule(s) they affect as part of the title. I certainly have no problem in requiring that every PR in this project has `[NDI]` for `ndimage` as part of the description, e.g. "[NDI] ENH: blah, blah, blah..." Jaime
In statsmodels all the extra branches in the main repo are stale or stalled and are waiting for someone to pick up. Actual development is in developer forks.
(I'm not involved enough in scipy development to have an opinion.)
Josef
Thanks!
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
On Mon, May 25, 2015 at 11:10 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Mon, May 25, 2015 at 1:10 PM, <josef.pktd@gmail.com> wrote:
On Mon, May 25, 2015 at 3:37 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
Hi all,
This year I am mentoring Aman for one of the GSoC projects we have underway, "Rewriting ndimage in Cython." By its very nature it doesn't conform very well to the "many small pull requests" model: from the point of view of scipy, things are going to be broken up until almost the very last commit. I am not sure what the best way to set up a collaborative code development environment would be, and so am asking for the collective wisdom to help guide us.
Aman could simply create one ginormous pull request that will grow, and grow, and not be merged until everything was ready. I don't like this idea too much, as it is going to eventually be a confusing mess, and I think it would also make it difficult for others than Aman (that would mostly be me) to contribute code.
This is the worst option I think; it's very hard to keep track of what's already reviewed in an open PR.
I think we could also use a branch, either on my fork of scipy or on Aman's, as the repository on which development would happen, and against which PRs would be created, and once completed send a single PR to the main scipy repo. This may work, but I don't like it much either.
What probably makes more sense is to create a new branch **in the main scipy repository**, and have PRs sent and merged against that branch, which would eventually be merged with master upon completion. NumPy seems to have a couple such experimental branches ('with_maskna' and 'enable_separate_by_default'), although there is none in SciPy that I see. This would also allow us to keep the project in a controlled environment, even if by the end of the summer not every single bit of ndimage has been ported.
I'm not a fan of topic branches in the scipy repo, but given the nature of the rewrite it's probably the best alternative. So fine with me.
If this third path is really the preferred way of doing things, I could probably set things up myself (Ralf gave me commit rights when I became a mentor for this project),
That's what I get for forgetting to send the announcement in time:) Not just me by the way - decided on by all active core devs.
but I'd like to hear what others think, before abusing my powers.
I don't see much difference between options two and three for working with github since it''s easy to create and merge pull requests across forks.
One consideration is whether you want to trigger the TravisCI runs, which I guess would be automatic with a branch in the main scipy repo.
Another consideration is whether this triggers a large amount of notifications for scipy developers that are subscribed to changes, PRs and issues. (and if it's easy for me to filter those out visually in gmail)
I understand the concern: I am myself often baffled by the amount of e-mail from the scipy repo that I delete without reading. That said, what you sign up for when you sign up to the SciPy github repo is to receive all notifications related to that repo, which does include ndimage... And I don't think that the fact that these changes will be part of GSoC warrants treating them in a special way. Also, given the volume of mail that SciPy already generates, this will probably be lost like tears in the rain anyway...
As a more general thing, it may make sense to come up with three letter acronyms for each of the scipy submodules, and ask that PRs identify the submodule(s) they affect as part of the title. I certainly have no problem in requiring that every PR in this project has `[NDI]` for `ndimage` as part of the description, e.g. "[NDI] ENH: blah, blah, blah..."
No reason not to type the module name out in full, and I'd like to keep the standard prefixes at the front. So I'd be more in favor of "ENH: ndimage: ...". I actually already do that sometimes. Ralf
Jaime
In statsmodels all the extra branches in the main repo are stale or stalled and are waiting for someone to pick up. Actual development is in developer forks.
(I'm not involved enough in scipy development to have an opinion.)
Josef
Thanks!
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On Mon, May 25, 2015 at 12:37 PM, Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Hi all,
This year I am mentoring Aman for one of the GSoC projects we have underway, "Rewriting ndimage in Cython." By its very nature it doesn't conform very well to the "many small pull requests" model: from the point of view of scipy, things are going to be broken up until almost the very last commit.
If it isn't viable to incrementally replace the current module in-place (which would normally be my suggestion if at all possible...), then one alternative approach might be to create a temporary module named _ndimage2 or something, with the idea that for now it does not provide everything that ndimage does, but whatever it does provide works. And that way one can review and merge complete, tested feature-addition PRs into _ndimage2 until it reaches feature parity with ndimage, at which point the last PR is just to swap the names. (Of course this is orthogonal to the question of where exactly the branch lives.) -n -- Nathaniel J. Smith -- http://vorpus.org
participants (7)
-
Gael Varoquaux
-
Jaime Fernández del Río
-
josef.pktd@gmail.com
-
Matthew Gidden
-
Nathaniel Smith
-
Ralf Gommers
-
Sturla Molden