On bug-fix releases and maintenance branches
Hi all, I wanted to stir some discussion on pandas its policy on bug-fx releases and upgrading pains. First some context: *Context part 1*: Currently we do not use maintenance branches for bugfix releases, and we actually also do not *really *do bugfix releases. We just develop further on master, and try to not merge breaking changes the first weeks/months, so we can do a minor kind of bug-fix release (but usually also with a lot of new features). But we don't, for example, backport fixes of regressions if they are fixed after master is pointing to the next major release. *Context part 2*: pandas is not yet that stable, in the sense that there are still quite some breaking changes in each release. I am not arguing for not doing these breaking changes, as some of these changes are really needed to clean up the API (although there are also arguments for that, but I think that is another discussion). This has the consequence that updating your pandas version is not always that pleasant. Sidenote: I have not that much experience with using pandas in a larger company or in larger codebases that need to be upgraded, rather with just my own code for my PhD. So it is difficult for me to judge on how much this is a problem or if bug-fx releases would help. Questions: - What are other people's experiences with upgrading pandas? And would more bug-fix releases actually ease the upgrading? - Do we want to do more bug-fix releases? - Having a maintenance branch and backporting fixes is extra work. Would we be able to handle this? Would it be worth the effort? (It has been mentioned before, but I think the main point raised was lack of manpower to maintain separate branches) To put it another way. In our whatsnew notice there is "*We recommend that all users upgrade to this version*", but I am actually not sure we should recommend that. I personally do not always recommend that no matter what *without careful consideration*. Regards, Joris
On Feb 9, 2016, at 6:59 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com <mailto:jorisvandenbossche@gmail.com>> wrote:
Hi all,
I wanted to stir some discussion on pandas its policy on bug-fx releases and upgrading pains. First some context:
Context part 1: Currently we do not use maintenance branches for bugfix releases, and we actually also do not really do bugfix releases. We just develop further on master, and try to not merge breaking changes the first weeks/months, so we can do a minor kind of bug-fix release (but usually also with a lot of new features). But we don't, for example, backport fixes of regressions if they are fixed after master is pointing to the next major release.
Context part 2: pandas is not yet that stable, in the sense that there are still quite some breaking changes in each release. I am not arguing for not doing these breaking changes, as some of these changes are really needed to clean up the API (although there are also arguments for that, but I think that is another discussion). This has the consequence that updating your pandas version is not always that pleasant.
The third bit of context here is an eventual pandas 1.0. I could see us applying bug fixes to a pre-1.0 maintenance branch along side the 1.x branch initially. Perhaps it’s worth practicing that policy a bit before we get to 1.0.
Sidenote: I have not that much experience with using pandas in a larger company or in larger codebases that need to be upgraded, rather with just my own code for my PhD. So it is difficult for me to judge on how much this is a problem or if bug-fx releases would help.
Questions: • What are other people's experiences with upgrading pandas? And would more bug-fix releases actually ease the upgrading? • Do we want to do more bug-fix releases? • Having a maintenance branch and backporting fixes is extra work. Would we be able to handle this? Would it be worth the effort? (It has been mentioned before, but I think the main point raised was lack of manpower to maintain separate branches)
To put it another way. In our whatsnew notice there is "We recommend that all users upgrade to this version", but I am actually not sure we should recommend that. I personally do not always recommend that no matter what without careful consideration.
This <http://columbia-applied-data-science.github.io/pages/lowclass-python-style-g...> style guide was going around today, and it mentioned
The basic Pandas API is still changing. When possible, production code should use numpy or standard Python.
The copyright on that page is 2012 so it could be a bit dated (it’s also copyrighted to Chang She among others). I do think pandas is at the point in its development cycle where we should be more conservative. And I think we have been a bit, but perhaps we can advertise that more.
Regards, Joris _______________________________________________ Pandas-dev mailing list Pandas-dev@python.org <mailto:Pandas-dev@python.org> https://mail.python.org/mailman/listinfo/pandas-dev
- Tom
hi Joris, I'm sorry it's taken a couple weeks to write a reply -- been really busy and wanted to put some thought into this. This is a really important discussion given how important pandas has become to so many people, thank you for bringing it up. On Tue, Feb 9, 2016 at 4:59 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi all,
I wanted to stir some discussion on pandas its policy on bug-fx releases and upgrading pains. First some context:
Context part 1: Currently we do not use maintenance branches for bugfix releases, and we actually also do not really do bugfix releases. We just develop further on master, and try to not merge breaking changes the first weeks/months, so we can do a minor kind of bug-fix release (but usually also with a lot of new features). But we don't, for example, backport fixes of regressions if they are fixed after master is pointing to the next major release.
I think in general it would be a good idea to tilt development away from new feature development and toward bug fixes and stability. Given that we are contemplating making some breaking changes in a 1.x development branch (like removing the Panel classes), we should decide as some point to create a 0.X.Y maintenance line where we can backport bug fixes only, so that "legacy pandas" users can have a "LTS" (in Ubuntu parlance) maintenance branch. This introduces some development overhead but it seems worth it.
Context part 2: pandas is not yet that stable, in the sense that there are still quite some breaking changes in each release. I am not arguing for not doing these breaking changes, as some of these changes are really needed to clean up the API (although there are also arguments for that, but I think that is another discussion). This has the consequence that updating your pandas version is not always that pleasant.
Over the years I've heard many horror stories from companies who have created and maintained internal 0.7.x, 0.8.x, or 0.9.x pandas forks because of the API breakage issues. This is definitely an anti-pattern that we should try to avoid happening in the future, but API breakages in many cases are the inevitable price of progress. Some of the API breakage has resulted from experiences accumulated over a long period of time -- I made a lot of decisions early on in the project that ended up not being the right ones (e.g. resample default arguments changed at one point). There wasn't enough community engagement at that point to have a thorough design process to potentially come up with the "right" design first. In other cases, the "right" choice was perhaps more ambiguous. API changes are most painful for users who do not write tests for their code that depends on pandas. That problem is probably not fixable =) I think having stable releases with backports of serious correctness bugs helps mitigate this problem, whereas modest API changes between major releases. I would also be in favor of having point releases only contain bug fixes rather than the current system of point releases being a stable snapshot of trunk. Since Jeff is the most affected by this on a day to day basis as de facto steward of the PR queue I would be curious what process he feels would be the most helpful. - Wes
Sidenote: I have not that much experience with using pandas in a larger company or in larger codebases that need to be upgraded, rather with just my own code for my PhD. So it is difficult for me to judge on how much this is a problem or if bug-fx releases would help.
Questions:
What are other people's experiences with upgrading pandas? And would more bug-fix releases actually ease the upgrading? Do we want to do more bug-fix releases? Having a maintenance branch and backporting fixes is extra work. Would we be able to handle this? Would it be worth the effort?
(It has been mentioned before, but I think the main point raised was lack of manpower to maintain separate branches)
To put it another way. In our whatsnew notice there is "We recommend that all users upgrade to this version", but I am actually not sure we should recommend that. I personally do not always recommend that no matter what without careful consideration.
Regards, Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
Thanks for bringing this up joris, here are some thoughts. 1) I agree that the next releases should probably focus on bug fixes. So this might mean we should shoot for 0.18.2....3 etc. However, we do need a 0.19.0 in order to provide any big deprecations (Panel) and API changes that are needed. 2) I am a bit hesitant to even make a big break (1.0) because I have seen this just bifurcating people (e.g. do I upgrade now, what if I want compat). This just creates less community. So I think this should be a goal, that even though its called 1.0 it is as back-compat as possible. 3) Releases can be big, and do fix lots of bugs, and usually introduce new ones. This is almost inevitable as we add new features, changes, and even bug fixes which occasionally have regressions (though test suite is pretty good, so hopefully not too often). 4) I don't relish backporting things. I think this could lead to lots of headaches and IMHO doesn't really buy much. 5) We don't want to just go into maintenance mode because we still have a fair amount of feature requests. (though these are often pretty targeted), but off of the top of my head, nothing really *new*, mainly some API changes to bring consistency. E.g. ``.agg`` on a DataFrame is a long-requested feature, which actually after 0.18.0 is quite trivially to do. 6) I think we telegraph any API changes and really really try to have back-compat, so people do have the ability to upgrade at their leisure. API changes are most painful for users who do not write tests for
their code that depends on pandas. That problem is probably not fixable =)
of course this is a telling point. pandas upgrades often expose bugs in user code. I view this as a good thing! So given all of the somewhat contradictory points above, what do I really think we should do? In order for pandas to be (even more) of a force in leading the scientific community. I think we have to grow. So having more contributors is a great thing. People do like / appreciate fixing bugs, but even more (IMHO), are performance enhancements and *some* new features. I will probably try to do more bug-fixing (rather than large API's ish fixes) I think. There is quite a back-log. This should *slow* the issue of the BIG API changes. So I am kind of -1 on backports for mostly 2), it seems to just slow things down, and 4) it can often lead to MORE things being inconcistent (you need machinery to ensure that what is backported is correct and is included). I can easily forsee that we decide to create 'stable' branches, which in fact are stable but might have inconsistent fixes, this is even more confusing in my view. I think we have a fairly aggressive release cycle. We for sure don't want to debate everything. I am of the opinion that it is much better to put things out there quicker, then to endlessly debate extremely minor points (not naming project names here :). For the general user what we do w.r.t. release cycles probably doesn't matter, and for the corporate user, they almost always have a 'fixed' version anyhow (and then they do of course port the new ones, but then they have people upgraded it carefully). I am not so sure we should impose structure on this. We already have announced major releases and minor releases. All for better 'language' in the minor releases. Jeff On Tue, Feb 23, 2016 at 2:21 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi Joris,
I'm sorry it's taken a couple weeks to write a reply -- been really busy and wanted to put some thought into this.
This is a really important discussion given how important pandas has become to so many people, thank you for bringing it up.
On Tue, Feb 9, 2016 at 4:59 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi all,
I wanted to stir some discussion on pandas its policy on bug-fx releases and upgrading pains. First some context:
Context part 1: Currently we do not use maintenance branches for bugfix releases, and we actually also do not really do bugfix releases. We just develop further on master, and try to not merge breaking changes the first weeks/months, so we can do a minor kind of bug-fix release (but usually also with a lot of new features). But we don't, for example, backport fixes of regressions if they are fixed after master is pointing to the next major release.
I think in general it would be a good idea to tilt development away from new feature development and toward bug fixes and stability. Given that we are contemplating making some breaking changes in a 1.x development branch (like removing the Panel classes), we should decide as some point to create a 0.X.Y maintenance line where we can backport bug fixes only, so that "legacy pandas" users can have a "LTS" (in Ubuntu parlance) maintenance branch. This introduces some development overhead but it seems worth it.
Context part 2: pandas is not yet that stable, in the sense that there
still quite some breaking changes in each release. I am not arguing for not doing these breaking changes, as some of these changes are really needed to clean up the API (although there are also arguments for that, but I
are think
that is another discussion). This has the consequence that updating your pandas version is not always that pleasant.
Over the years I've heard many horror stories from companies who have created and maintained internal 0.7.x, 0.8.x, or 0.9.x pandas forks because of the API breakage issues. This is definitely an anti-pattern that we should try to avoid happening in the future, but API breakages in many cases are the inevitable price of progress.
Some of the API breakage has resulted from experiences accumulated over a long period of time -- I made a lot of decisions early on in the project that ended up not being the right ones (e.g. resample default arguments changed at one point). There wasn't enough community engagement at that point to have a thorough design process to potentially come up with the "right" design first. In other cases, the "right" choice was perhaps more ambiguous.
API changes are most painful for users who do not write tests for their code that depends on pandas. That problem is probably not fixable =)
I think having stable releases with backports of serious correctness bugs helps mitigate this problem, whereas modest API changes between major releases. I would also be in favor of having point releases only contain bug fixes rather than the current system of point releases being a stable snapshot of trunk.
Since Jeff is the most affected by this on a day to day basis as de facto steward of the PR queue I would be curious what process he feels would be the most helpful.
- Wes
Sidenote: I have not that much experience with using pandas in a larger company or in larger codebases that need to be upgraded, rather with
just my
own code for my PhD. So it is difficult for me to judge on how much this is a problem or if bug-fx releases would help.
Questions:
What are other people's experiences with upgrading pandas? And would more bug-fix releases actually ease the upgrading? Do we want to do more bug-fix releases? Having a maintenance branch and backporting fixes is extra work. Would we be able to handle this? Would it be worth the effort?
(It has been mentioned before, but I think the main point raised was lack of manpower to maintain separate branches)
To put it another way. In our whatsnew notice there is "We recommend that all users upgrade to this version", but I am actually not sure we should recommend that. I personally do not always recommend that no matter what without careful consideration.
Regards, Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
hey Jeff, On Tue, Feb 23, 2016 at 12:11 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Thanks for bringing this up joris, here are some thoughts.
1) I agree that the next releases should probably focus on bug fixes. So this might mean we should shoot for 0.18.2....3 etc.
However, we do need a 0.19.0 in order to provide any big deprecations (Panel) and API changes that are needed.
2) I am a bit hesitant to even make a big break (1.0) because I have seen this just bifurcating people (e.g. do I upgrade now, what if I want compat). This just creates less community. So I think this should be a goal, that even though its called 1.0 it is as back-compat as possible.
Yeah, with more significant internal refactoring the goal would be to not break API compatibility unless absolutely necessary. However, fixing such horror shows as this In [2]: import pandas as pd In [3]: s = pd.Series([1,2,3]) In [4]: s Out[4]: 0 1 1 2 2 3 dtype: int64 In [5]: import numpy as np In [6]: s[1] = np.nan In [7]: s Out[7]: 0 1 1 NaN 2 3 dtype: float64 should be fair game.
3) Releases can be big, and do fix lots of bugs, and usually introduce new ones. This is almost inevitable as we add new features, changes, and even bug fixes which occasionally have regressions (though test suite is pretty good, so hopefully not too often).
4) I don't relish backporting things. I think this could lead to lots of headaches and IMHO doesn't really buy much.
I think what we are talking about is backporting bug fixes for major brokenness (e.g. serious correctness issues) or regressions that aren't caught by major release time. I think what's been happening in practice is that people are creating their own patched bugfix versions of releases to avoid the pain induced by API-breakage in major releases. Obviously, continuing to innovate and clean up the API (with judicious breakage where absolutely necessary -- I think the resampling cleanup is a good example where the net benefit in the long run will be high) but we have to take care of the user base, many of whom depend on pandas in production applications. This is all made more difficult because there isn't any direct cash flow funding pandas development AFAICT. Where I work, for example, we have many employees who are responsible for creating patched builds and handling backports for otherwise API-stable branches of major Apache open source projects. But we can afford to do this because customers are paying for this (priority support and backports / patched builds). So what I would suggest, in lieu of financial support for backports and maintenance builds, is that we consider maint-0.XX.X branches for backporting only the most serious of serious bug fixes ("Bad Bugs"). Major regressions and correctness issues should go into this bucket. Perhaps we can start doing this with 0.18.x -- as a matter of process if any PR appears to fix a Bad Bug it should be brought up here on the mailing list so we can decide whether it should be backported.
5) We don't want to just go into maintenance mode because we still have a fair amount of feature requests. (though these are often pretty targeted), but off of the top of my head, nothing really *new*, mainly some API changes to bring consistency. E.g. ``.agg`` on a DataFrame is a long-requested feature, which actually after 0.18.0 is quite trivially to do.
Yeah, I think we should try to stick with https://en.wikipedia.org/wiki/Open/closed_principle -- so conveniences, extensions to existing APIs, and other helpful new features are fair game, but breaking API changes should be
6) I think we telegraph any API changes and really really try to have back-compat, so people do have the ability to upgrade at their leisure.
API changes are most painful for users who do not write tests for their code that depends on pandas. That problem is probably not fixable =)
of course this is a telling point. pandas upgrades often expose bugs in user code. I view this as a good thing!
So given all of the somewhat contradictory points above, what do I really think we should do?
In order for pandas to be (even more) of a force in leading the scientific community. I think we have to grow. So having more contributors is a great thing. People do like / appreciate fixing bugs, but even more (IMHO), are performance enhancements and *some* new features.
I will probably try to do more bug-fixing (rather than large API's ish fixes) I think. There is quite a back-log. This should *slow* the issue of the BIG API changes.
So I am kind of -1 on backports for mostly 2), it seems to just slow things down, and 4) it can often lead to MORE things being inconcistent (you need machinery to ensure that what is backported is correct and is included). I can easily forsee that we decide to create 'stable' branches, which in fact are stable but might have inconsistent fixes, this is even more confusing in my view.
Let me know what you think about my Bad Bug = backport policy. This is mostly about communication and keeping track of serious issues that should necessitate upgrading. I also think we should try to keep minor releases API stable from here on out; so this may result in our version numbers increasing more quickly but that's OK for the improved communication about "what is a minor release (major release plus bug fix backports)" - Wes
I think we have a fairly aggressive release cycle. We for sure don't want to debate everything. I am of the opinion that it is much better to put things out there quicker, then to endlessly debate extremely minor points (not naming project names here :).
For the general user what we do w.r.t. release cycles probably doesn't matter, and for the corporate user, they almost always have a 'fixed' version anyhow (and then they do of course port the new ones, but then they have people upgraded it carefully). I am not so sure we should impose structure on this. We already have announced major releases and minor releases.
All for better 'language' in the minor releases.
Jeff
On Tue, Feb 23, 2016 at 2:21 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi Joris,
I'm sorry it's taken a couple weeks to write a reply -- been really busy and wanted to put some thought into this.
This is a really important discussion given how important pandas has become to so many people, thank you for bringing it up.
On Tue, Feb 9, 2016 at 4:59 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi all,
I wanted to stir some discussion on pandas its policy on bug-fx releases and upgrading pains. First some context:
Context part 1: Currently we do not use maintenance branches for bugfix releases, and we actually also do not really do bugfix releases. We just develop further on master, and try to not merge breaking changes the first weeks/months, so we can do a minor kind of bug-fix release (but usually also with a lot of new features). But we don't, for example, backport fixes of regressions if they are fixed after master is pointing to the next major release.
I think in general it would be a good idea to tilt development away from new feature development and toward bug fixes and stability. Given that we are contemplating making some breaking changes in a 1.x development branch (like removing the Panel classes), we should decide as some point to create a 0.X.Y maintenance line where we can backport bug fixes only, so that "legacy pandas" users can have a "LTS" (in Ubuntu parlance) maintenance branch. This introduces some development overhead but it seems worth it.
Context part 2: pandas is not yet that stable, in the sense that there are still quite some breaking changes in each release. I am not arguing for not doing these breaking changes, as some of these changes are really needed to clean up the API (although there are also arguments for that, but I think that is another discussion). This has the consequence that updating your pandas version is not always that pleasant.
Over the years I've heard many horror stories from companies who have created and maintained internal 0.7.x, 0.8.x, or 0.9.x pandas forks because of the API breakage issues. This is definitely an anti-pattern that we should try to avoid happening in the future, but API breakages in many cases are the inevitable price of progress.
Some of the API breakage has resulted from experiences accumulated over a long period of time -- I made a lot of decisions early on in the project that ended up not being the right ones (e.g. resample default arguments changed at one point). There wasn't enough community engagement at that point to have a thorough design process to potentially come up with the "right" design first. In other cases, the "right" choice was perhaps more ambiguous.
API changes are most painful for users who do not write tests for their code that depends on pandas. That problem is probably not fixable =)
I think having stable releases with backports of serious correctness bugs helps mitigate this problem, whereas modest API changes between major releases. I would also be in favor of having point releases only contain bug fixes rather than the current system of point releases being a stable snapshot of trunk.
Since Jeff is the most affected by this on a day to day basis as de facto steward of the PR queue I would be curious what process he feels would be the most helpful.
- Wes
Sidenote: I have not that much experience with using pandas in a larger company or in larger codebases that need to be upgraded, rather with just my own code for my PhD. So it is difficult for me to judge on how much this is a problem or if bug-fx releases would help.
Questions:
What are other people's experiences with upgrading pandas? And would more bug-fix releases actually ease the upgrading? Do we want to do more bug-fix releases? Having a maintenance branch and backporting fixes is extra work. Would we be able to handle this? Would it be worth the effort?
(It has been mentioned before, but I think the main point raised was lack of manpower to maintain separate branches)
To put it another way. In our whatsnew notice there is "We recommend that all users upgrade to this version", but I am actually not sure we should recommend that. I personally do not always recommend that no matter what without careful consideration.
Regards, Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
2016-03-08 23:48 GMT+01:00 Wes McKinney <wesmckinn@gmail.com>:
hey Jeff,
On Tue, Feb 23, 2016 at 12:11 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Thanks for bringing this up joris, here are some thoughts.
1) I agree that the next releases should probably focus on bug fixes. So this might mean we should shoot for 0.18.2....3 etc.
However, we do need a 0.19.0 in order to provide any big deprecations (Panel) and API changes that are needed.
2) I am a bit hesitant to even make a big break (1.0) because I have seen this just bifurcating people (e.g. do I upgrade now, what if I want compat). This just creates less community. So I think this should be a goal, that even though its called 1.0 it is as back-compat as possible.
Yeah, with more significant internal refactoring the goal would be to not break API compatibility unless absolutely necessary.
Jeff, to be clear, my initial mail was not to discuss the issue whether to do a major breaking release or not, or going into general maintanance mode or not (that's certainly an interesting discussion, but another one I think). The fact is that we are still cleaning up things and so do breakages in 0.X releases (like the resample now), and that won't directly stop. But given that context, we can think about how to do 0.X.X releases that help users as much as possible to upgrade smoothly. We now put quite a lot in a micro 0.X.X bug-fix releases (including new features), which can have the consequence that it introduces new bugs.
3) Releases can be big, and do fix lots of bugs, and usually introduce new ones. This is almost inevitable as we add new features, changes, and even bug fixes which occasionally have regressions (though test suite is pretty good, so hopefully not too often).
4) I don't relish backporting things. I think this could lead to lots of headaches and IMHO doesn't really buy much.
I think what we are talking about is backporting bug fixes for major brokenness (e.g. serious correctness issues) or regressions that aren't caught by major release time. I think what's been happening in practice is that people are creating their own patched bugfix versions of releases to avoid the pain induced by API-breakage in major releases.
Obviously, continuing to innovate and clean up the API (with judicious breakage where absolutely necessary -- I think the resampling cleanup is a good example where the net benefit in the long run will be high) but we have to take care of the user base, many of whom depend on pandas in production applications.
This is all made more difficult because there isn't any direct cash flow funding pandas development AFAICT. Where I work, for example, we have many employees who are responsible for creating patched builds and handling backports for otherwise API-stable branches of major Apache open source projects. But we can afford to do this because customers are paying for this (priority support and backports / patched builds).
So what I would suggest, in lieu of financial support for backports and maintenance builds, is that we consider maint-0.XX.X branches for backporting only the most serious of serious bug fixes ("Bad Bugs"). Major regressions and correctness issues should go into this bucket. Perhaps we can start doing this with 0.18.x -- as a matter of process if any PR appears to fix a Bad Bug it should be brought up here on the mailing list so we can decide whether it should be backported.
With regard to the possible concern of "this is too much work": I don't think it would be many bug fixes that would be backported. For example, the last micro release, 0.17.1 had quite a lot of new features and the whatsnew notes listed 50 bug fixes. But a lot of these bug fixes were not regressions, but were bugs that were also in the previous releases. So if we restrict the 0.xx.x release to only regressions, it would be a much smaller of maybe 10 to 15 bug fixes (rough estimate, didn't look into detail). But in any case I think this would be a rather manageable amount. So that would make our bug fix releases smaller, and we also don't have to hold up master with breaking changes/larger new features until one or two bug-fix releases are released. For me, the fixes that could go in such a bug-fix release: - bug fixes or clean-up of rough edges of major new features in the 0.X release (for example for 0.18.1 possible changes to the newly introduced RangeIndex) - regressions, issues that were not present in the previous 0.X release, and could make it therefore more difficult to upgrade + the correctness issues that Wes mentioned.
5) We don't want to just go into maintenance mode because we still have a fair amount of feature requests. (though these are often pretty targeted), but off of the top of my head, nothing really *new*, mainly some API changes to bring consistency. E.g. ``.agg`` on a DataFrame is a long-requested feature, which actually after 0.18.0 is quite trivially to do.
Yeah, I think we should try to stick with https://en.wikipedia.org/wiki/Open/closed_principle -- so conveniences, extensions to existing APIs, and other helpful new features are fair game, but breaking API changes should be
6) I think we telegraph any API changes and really really try to have back-compat, so people do have the ability to upgrade at their leisure.
API changes are most painful for users who do not write tests for their code that depends on pandas. That problem is probably not fixable =)
of course this is a telling point. pandas upgrades often expose bugs in user code. I view this as a good thing!
So given all of the somewhat contradictory points above, what do I really think we should do?
In order for pandas to be (even more) of a force in leading the scientific community. I think we have to grow. So having more contributors is a great thing. People do like / appreciate fixing bugs, but even more (IMHO), are performance enhancements and *some* new features.
I will probably try to do more bug-fixing (rather than large API's ish fixes) I think. There is quite a back-log. This should *slow* the issue of the BIG API changes.
So I am kind of -1 on backports for mostly 2), it seems to just slow things down, and 4) it can often lead to MORE things being inconcistent (you need machinery to ensure that what is backported is correct and is included). I can easily forsee that we decide to create 'stable' branches, which in fact are stable but might have inconsistent fixes, this is even more confusing in my view.
Let me know what you think about my Bad Bug = backport policy. This is mostly about communication and keeping track of serious issues that should necessitate upgrading.
I also think we should try to keep minor releases API stable from here on out; so this may result in our version numbers increasing more quickly but that's OK for the improved communication about "what is a minor release (major release plus bug fix backports)"
Just for clarity, with minor release, do you mean the 0.X releases? (because 0.X.X matches more the 'major release plus bug fix backports' description) Joris
- Wes
I think we have a fairly aggressive release cycle. We for sure don't want to debate everything. I am of the opinion that it is much better to put things out there quicker, then to endlessly debate extremely minor points (not naming project names here :).
For the general user what we do w.r.t. release cycles probably doesn't matter, and for the corporate user, they almost always have a 'fixed' version anyhow (and then they do of course port the new ones, but then they have people upgraded it carefully). I am not so sure we should impose structure on this. We already have announced major releases and minor releases.
All for better 'language' in the minor releases.
Jeff
On Tue, Feb 23, 2016 at 2:21 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
hi Joris,
I'm sorry it's taken a couple weeks to write a reply -- been really busy and wanted to put some thought into this.
This is a really important discussion given how important pandas has become to so many people, thank you for bringing it up.
On Tue, Feb 9, 2016 at 4:59 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
Hi all,
I wanted to stir some discussion on pandas its policy on bug-fx
and upgrading pains. First some context:
Context part 1: Currently we do not use maintenance branches for bugfix releases, and we actually also do not really do bugfix releases. We just develop further on master, and try to not merge breaking changes the first weeks/months, so we can do a minor kind of bug-fix release (but usually also with a lot of new features). But we don't, for example, backport fixes of regressions if they are fixed after master is pointing to the next major release.
I think in general it would be a good idea to tilt development away from new feature development and toward bug fixes and stability. Given that we are contemplating making some breaking changes in a 1.x development branch (like removing the Panel classes), we should decide as some point to create a 0.X.Y maintenance line where we can backport bug fixes only, so that "legacy pandas" users can have a "LTS" (in Ubuntu parlance) maintenance branch. This introduces some development overhead but it seems worth it.
Context part 2: pandas is not yet that stable, in the sense that there are still quite some breaking changes in each release. I am not arguing
for
not doing these breaking changes, as some of these changes are really needed to clean up the API (although there are also arguments for that, but I think that is another discussion). This has the consequence that updating your pandas version is not always that pleasant.
Over the years I've heard many horror stories from companies who have created and maintained internal 0.7.x, 0.8.x, or 0.9.x pandas forks because of the API breakage issues. This is definitely an anti-pattern that we should try to avoid happening in the future, but API breakages in many cases are the inevitable price of progress.
Some of the API breakage has resulted from experiences accumulated over a long period of time -- I made a lot of decisions early on in the project that ended up not being the right ones (e.g. resample default arguments changed at one point). There wasn't enough community engagement at that point to have a thorough design process to potentially come up with the "right" design first. In other cases, the "right" choice was perhaps more ambiguous.
API changes are most painful for users who do not write tests for their code that depends on pandas. That problem is probably not fixable =)
I think having stable releases with backports of serious correctness bugs helps mitigate this problem, whereas modest API changes between major releases. I would also be in favor of having point releases only contain bug fixes rather than the current system of point releases being a stable snapshot of trunk.
Since Jeff is the most affected by this on a day to day basis as de facto steward of the PR queue I would be curious what process he feels would be the most helpful.
- Wes
Sidenote: I have not that much experience with using pandas in a
larger
company or in larger codebases that need to be upgraded, rather with just my own code for my PhD. So it is difficult for me to judge on how much
releases this
is a problem or if bug-fx releases would help.
Questions:
What are other people's experiences with upgrading pandas? And would more bug-fix releases actually ease the upgrading? Do we want to do more bug-fix releases? Having a maintenance branch and backporting fixes is extra work. Would we be able to handle this? Would it be worth the effort?
(It has been mentioned before, but I think the main point raised was lack of manpower to maintain separate branches)
To put it another way. In our whatsnew notice there is "We recommend that all users upgrade to this version", but I am actually not sure we should recommend that. I personally do not always recommend that no matter what without careful consideration.
Regards, Joris
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
_______________________________________________ Pandas-dev mailing list Pandas-dev@python.org https://mail.python.org/mailman/listinfo/pandas-dev
On Tue, Mar 8, 2016 at 5:23 PM, Joris Van den Bossche <jorisvandenbossche@gmail.com> wrote:
2016-03-08 23:48 GMT+01:00 Wes McKinney <wesmckinn@gmail.com>:
hey Jeff,
On Tue, Feb 23, 2016 at 12:11 PM, Jeff Reback <jeffreback@gmail.com> wrote:
Thanks for bringing this up joris, here are some thoughts.
1) I agree that the next releases should probably focus on bug fixes. So this might mean we should shoot for 0.18.2....3 etc.
However, we do need a 0.19.0 in order to provide any big deprecations (Panel) and API changes that are needed.
2) I am a bit hesitant to even make a big break (1.0) because I have seen this just bifurcating people (e.g. do I upgrade now, what if I want compat). This just creates less community. So I think this should be a goal, that even though its called 1.0 it is as back-compat as possible.
Yeah, with more significant internal refactoring the goal would be to not break API compatibility unless absolutely necessary.
Jeff, to be clear, my initial mail was not to discuss the issue whether to do a major breaking release or not, or going into general maintanance mode or not (that's certainly an interesting discussion, but another one I think). The fact is that we are still cleaning up things and so do breakages in 0.X releases (like the resample now), and that won't directly stop.
But given that context, we can think about how to do 0.X.X releases that help users as much as possible to upgrade smoothly. We now put quite a lot in a micro 0.X.X bug-fix releases (including new features), which can have the consequence that it introduces new bugs.
3) Releases can be big, and do fix lots of bugs, and usually introduce new ones. This is almost inevitable as we add new features, changes, and even bug fixes which occasionally have regressions (though test suite is pretty good, so hopefully not too often).
4) I don't relish backporting things. I think this could lead to lots of headaches and IMHO doesn't really buy much.
I think what we are talking about is backporting bug fixes for major brokenness (e.g. serious correctness issues) or regressions that aren't caught by major release time. I think what's been happening in practice is that people are creating their own patched bugfix versions of releases to avoid the pain induced by API-breakage in major releases.
Obviously, continuing to innovate and clean up the API (with judicious breakage where absolutely necessary -- I think the resampling cleanup is a good example where the net benefit in the long run will be high) but we have to take care of the user base, many of whom depend on pandas in production applications.
This is all made more difficult because there isn't any direct cash flow funding pandas development AFAICT. Where I work, for example, we have many employees who are responsible for creating patched builds and handling backports for otherwise API-stable branches of major Apache open source projects. But we can afford to do this because customers are paying for this (priority support and backports / patched builds).
So what I would suggest, in lieu of financial support for backports and maintenance builds, is that we consider maint-0.XX.X branches for backporting only the most serious of serious bug fixes ("Bad Bugs"). Major regressions and correctness issues should go into this bucket. Perhaps we can start doing this with 0.18.x -- as a matter of process if any PR appears to fix a Bad Bug it should be brought up here on the mailing list so we can decide whether it should be backported.
With regard to the possible concern of "this is too much work": I don't think it would be many bug fixes that would be backported. For example, the last micro release, 0.17.1 had quite a lot of new features and the whatsnew notes listed 50 bug fixes. But a lot of these bug fixes were not regressions, but were bugs that were also in the previous releases. So if we restrict the 0.xx.x release to only regressions, it would be a much smaller of maybe 10 to 15 bug fixes (rough estimate, didn't look into detail). But in any case I think this would be a rather manageable amount.
So that would make our bug fix releases smaller, and we also don't have to hold up master with breaking changes/larger new features until one or two bug-fix releases are released.
For me, the fixes that could go in such a bug-fix release:
- bug fixes or clean-up of rough edges of major new features in the 0.X release (for example for 0.18.1 possible changes to the newly introduced RangeIndex) - regressions, issues that were not present in the previous 0.X release, and could make it therefore more difficult to upgrade
+ the correctness issues that Wes mentioned.
5) We don't want to just go into maintenance mode because we still have a fair amount of feature requests. (though these are often pretty targeted), but off of the top of my head, nothing really *new*, mainly some API changes to bring consistency. E.g. ``.agg`` on a DataFrame is a long-requested feature, which actually after 0.18.0 is quite trivially to do.
Yeah, I think we should try to stick with https://en.wikipedia.org/wiki/Open/closed_principle -- so conveniences, extensions to existing APIs, and other helpful new features are fair game, but breaking API changes should be
6) I think we telegraph any API changes and really really try to have back-compat, so people do have the ability to upgrade at their leisure.
API changes are most painful for users who do not write tests for their code that depends on pandas. That problem is probably not fixable =)
of course this is a telling point. pandas upgrades often expose bugs in user code. I view this as a good thing!
So given all of the somewhat contradictory points above, what do I really think we should do?
In order for pandas to be (even more) of a force in leading the scientific community. I think we have to grow. So having more contributors is a great thing. People do like / appreciate fixing bugs, but even more (IMHO), are performance enhancements and *some* new features.
I will probably try to do more bug-fixing (rather than large API's ish fixes) I think. There is quite a back-log. This should *slow* the issue of the BIG API changes.
So I am kind of -1 on backports for mostly 2), it seems to just slow things down, and 4) it can often lead to MORE things being inconcistent (you need machinery to ensure that what is backported is correct and is included). I can easily forsee that we decide to create 'stable' branches, which in fact are stable but might have inconsistent fixes, this is even more confusing in my view.
Let me know what you think about my Bad Bug = backport policy. This is mostly about communication and keeping track of serious issues that should necessitate upgrading.
I also think we should try to keep minor releases API stable from here on out; so this may result in our version numbers increasing more quickly but that's OK for the improved communication about "what is a minor release (major release plus bug fix backports)"
Just for clarity, with minor release, do you mean the 0.X releases? (because 0.X.X matches more the 'major release plus bug fix backports' description)
Sorry, I meant that 0.X.Y should be API stable with all other 0.X versions
Joris
participants (4)
-
Jeff Reback -
Joris Van den Bossche -
tom -
Wes McKinney