
On Wed, 2020-12-30 at 11:43 -0600, Sebastian Berg wrote:
On Wed, 2020-12-30 at 16:27 +0100, Ralf Gommers wrote: <snip>
That's very hard to describe, since it relies so much on previous experience and qualitative judgements. That's the main reason why I had more examples before, but they just led to more discussion about those examples - so that didn't quite have the intended effect.
<snip> I only took a short course and used this very little. I am sure there are many here with industry experience where the use of Q&A is every day work.
One concept from there is to create a risk/danger and probability assessment, which can be ad-hoc for your product. An example just to make something up:
I am not sure anyone finds this interesting or if fits to the NEP specifically [1], but I truly think it can be useful (although maybe it doesn't need to be formalized). So I fleshed it out: https://hackmd.io/WuS1rCzrTYOTgzUfRJUOnw (also pasted it below) My reasoning for suggesting it is that a process/formalism (no matter how ridiculous it may seem at first) for how to assess the impact of a backward compatible change can be helpful by: conceptualizing, clearly separating backward incompatible impact assessment from benefits assessment, making it easier to follow a decision/thought processes, and allowing some nuance [2]. I actually believe that it can help with difficult decisions, even if only applied occasionally, and that it is not a burden because it provides fairly steps. Will it be useful often? Maybe not. But every time there is a proposal and we pause and hesitate because it is unclear whether it is worth the backcompat impact, I think this can provide a way to discuss it and come to a decision as objectively as possible. (And no, I do not think that any of the categories or mitigation strategies are an exact science.) Cheers, Sebastian [1] This is additional to the proposed promises such as a two releases of deprecations and discussing most/all deprecations on the mailing list, which are unrelated. It is rather to provide a formalism where currently only the examples give points of reference. [2] There is a reason that also the Python version is short and intentionally fuzzy: https://www.python.org/dev/peps/pep-0387/ and https://discuss.python.org/t/pep-387-backwards-compatibilty-policy/4421 There are just few definite rules that can be formalized, so a framework for diligent assessment seems the best we can do (if we want to). Assessing impactHere “impact” means how unmodified code may be negatively affected by a change ignoring any deprecation period. To get an idea about how much impact a change has, try to list all potential impacts. This will often be just a single item (user of function x has to replace it with y), but it could be multiple different ones. After listing all potential impacts rank them on the following two scales (do not yet think about how to make the transition easier): 1. Severity (How bad is the impact for an affected user?)Minor: A performance regression or change in (undocumented) warning/error category will fall here. This type of change would normally not require a deprecation cycle or special consideration.Typical: Code must be updated to avoid an error, the update is simple to do in a way that works both on existing and future NumPy versions.Severe: Code will error or crash, and there is no simple work around or fix.Critical: Code returns incorrect results. A change requiring massive effort may fall here. A hard crash (e.g. segfault) in itself is typically not critical. 2. Likelihood (How many users does the change affect?)Rare: Change has very few impacted users (or even no known users after a code search). The normal assumption is that there is always someone affected, but a rarely used keyword argument of an already rarely used function will fall here.Limited: Change is in a rarely used function or function argument. Another possibility is that it affects only a small group of very advanced users.Common: Change affects a bigger audience or multiple large downstream libraries.Ubiquitous: Change affects a large fraction of NumPy users. The categories will not always be perfectly clear. That is OK. Rather than establishing precise guidelines, the purpose is a structured processes that can be reviewed. When the impact is exceptionally difficult to assess, it is often feasible to try a change on the development branch while signalling willigness to revert it. Downstream libraries test against it (and the release candidate) which gives a chance to correct an originally optimistic assessment. After assessing each impact, it will fall somewhere on the following table: Severity\LikelyhoodRareLimitedCommonUbiquitousMinorokokok?Typicalok?no?Severeno?noCriticalno?nonono Note that all changes should normally follow the two release deprecation warning policy (except “minor” ones). The “no” fields means a change is clearly unacceptable, although a NEP can always overrule it. This table only assesses the “impact”. It does not assess how the impact compares to the benefits of the proposed change. This must be favourable no matter how small the impact is. However, by assessing the impact, it will be easier to weigh it against the benefit. (Note that the table is not symmetric. An impact with “critical” severity is unlikely to be considered even when no known users are impacted.) Mitigation and arguing of benefitsAny change falling outside the “ok” fields requires careful consideration. When an impact is larger, you can try to mitigate it and “move” on the table. Some possible reasons for this are: * A avoidable warning for at least two releases (the policy for any change that modifies behaviour) reduces a change one category (usually from “typical” to “minor” severity). * The severity category may be reduced by creating an easy work around (i.e. to move it from “sever” to “typical”). * Sometimes a change may break working code, but also fix existing bugs, this can offset the severity. In extreme cases, this may warrant classifying a change as a bug-fix. * For particularly noisy changes (i.e. ubiquitous category) considering fixing downstream packages, delay the warning (or use a PendingDeprecationWarning). Simply prolonging the the deprecation period is also an option. This reduces how many users struggle with the change and smoothens the transition. * Exceptionally clear documentation and communication could be used to ensure that the impact is more acceptable. This may not be enough to move a “category” by itself, but also helps. After mitigation, the benefits can be assessed: * Any benefit of the change can be argued to “offset” the impact. If this is necessary, a broad community discussion on the mailing list is required. It should be clear that this does not actually “mitigate” the impact but rather argues that the benefit outweighs it. These are not a fixed set of rules, but provide a framework to assess and then try to mitigate the impact of a proposed change to an acceptable level. Arguing that a benefit can overcome multiple “impact” categories will require exceptionally large benefits, and most likely a NEP. For example a change with an initial impact classification of “severe” and “ubiquitous” is unlikely to even be considered unless the severity can be reduced. Many deprecations will fall somewhere below or equal to a “typical and limited” impact (i.e. removal of an uncommon function argument). They recieve a deprecation warning to make the impact acceptable with a brief discussiong that the change itself is worthwhile (i.e. the API is much cleaner afterwards). Any more disruptive change requires broad community discussion. This needs at least a discussion on the NumPy mailing list and it is likely that the person proposing it will be asked to write a NEP. Summary and reasoning for this processessThe aim of this process and table is to provide a loose formalism with the goal of: * Diligence: Following this process ensures detailed assessment of its impact without being distracted by the benefits. This is achieved by following well defined steps:Listing each potential impact (usually one).Assessing the severity.Assessing the likelihood.Discussing what steps are/can be taken to lower the impact ignoring any benefits.If the impact is not low at this point, this should prompt considering and listing of alternatives.Argue that the benefits outweigh the remaining impact. (This is a distinct step: the original impact assessment stands as it was.) * Transparency: Using this process for difficult decisions makes it easier for the reviewer and community to follow how a decision was made and criticize it. * Nuance: When the it is clear that an impact is larger than typical with will prompt more care and thought. In some cases it may also clarify that a change is lower impact than expected on first sight. * Experience: Using a similar formalism for many changes makes it easier to learn from past decisions by providing an approach to compare and conceptualize them. We aim to follow these steps in the future for difficult decisions. In general, any reviewer and community member may ask for this process to be followed for a proposed change, if the change is difficult, it will be worth the effort. If it is very low impact it will be quick to clarify why. NOTE: At this time the process is new and is expected to require clarification. ExamplesIt should be stressed again, that the categories will rarely be clear and intentially are categorized with some uncertainty below. Even unclear categories can help in forming a more clear idea of a change. HistogramThe “histogram” example doesn’t really add much with respect to this process. But noting the duplicate effort/impact would move probably move it into a more severe category than most deprecations. That makes it a more difficult decision and indicates that careful thought should be spend on alternatives. Integer indexing requirement * Severity: Typical–Severe (although fairly easy, users often had to do many changes) * Likelihood: Ubiquitous How ubiquitous it really was became probably only clear after the (rc?) release. The change would now probably go through a NEP as it initially falls into the lower right part of the table. To get into the “acceptable” part of the table we note that: 1. Real bugs were caught in the processes (argued to reduce severity) 2. The deprecation was delayed and longer than normally (argued to mitigate the number of affected users by giving much more time) Even with these considerations, it still has a larger impact and clearly requires careful thought and community discussion about the benefits. Removing financial functions * Severity: Severe (on the high end) * Likelihood: Limited (maybe common) While not used by a large user base (limited), the removal is disurptive (severe). The change ultimately required a NEP, since it is not easy to weigh the maintainence advantage of removing the functions against the impact to their users. The NEP included the reduction of the severity by providing a work- around: A pip installable package as a drop-in replacement (reducing the severity). For heavy users of these functions this will still be more severe than most deprecations, but it lowered the impact assessment enough to consider the benefit of removal to outweigh the impact.