Re: [Numpy-discussion] updated backwards compatibility and deprecation policy NEP

Jan. 1, 2021

      On Wed, 2020-12-30 at 11:43 -0600, Sebastian Berg wrote:
...
On Wed, 2020-12-30 at 16:27 +0100, Ralf Gommers wrote:
<snip>
...
That's very hard to describe, since it relies so much on previous
experience and qualitative judgements. That's the main reason why I
had
more examples before, but they just led to more discussion about
those
examples - so that didn't quite have the intended effect.
<snip>
I only took a short course and used this very little. I am sure there
are many here with industry experience where the use of Q&A is every
day work.
One concept from there is to create a risk/danger and probability
assessment, which can be ad-hoc for your product.  An example just to
make something up:
I am not sure anyone finds this interesting or if fits to the NEP
specifically [1], but I truly think it can be useful (although maybe it
doesn't need to be formalized). So I fleshed it
out: https://hackmd.io/WuS1rCzrTYOTgzUfRJUOnw (also pasted it below)

My reasoning for suggesting it is that a process/formalism (no matter
how ridiculous it may seem at first) for how to assess the impact of a
backward compatible change can be helpful by: conceptualizing, clearly
separating backward incompatible impact assessment from benefits
assessment, making it easier to follow a decision/thought processes,
and allowing some nuance [2].

I actually believe that it can help with difficult decisions, even if
only applied occasionally, and that it is not a burden because it
provides fairly steps. Will it be useful often? Maybe not. But every
time there is a proposal and we pause and hesitate because it is
unclear whether it is worth the backcompat impact, I think this can
provide a way to discuss it and come to a decision as objectively as
possible. (And no, I do not think that any of the categories or
mitigation strategies are an exact science.)

Cheers,

Sebastian

[1] This is additional to the proposed promises such as a two releases
of deprecations and discussing most/all deprecations on the mailing
list, which are unrelated. It is rather to provide a formalism where
currently only the examples give points of reference.
[2] There is a reason that also the Python version is short and
intentionally
fuzzy: https://www.python.org/dev/peps/pep-0387/ and https://discuss.python.org/t/pep-387-backwards-compatibilty-policy/4421
 There are just few definite rules that can be formalized, so a
framework for diligent assessment seems the best we can do (if we want
to).

Assessing impactHere “impact” means how unmodified code may be negatively affected by a
change ignoring any deprecation period.

To get an idea about how much impact a change has, try to list all
potential impacts. This will often be just a single item (user of
function x has to replace it with y), but it could be multiple
different ones. After listing all potential impacts rank them on the
following two scales (do not yet think about how to make the transition
easier):
   1. Severity (How bad is the impact for an affected user?)Minor: A
      performance regression or change in (undocumented) warning/error
      category will fall here. This type of change would normally not require
      a deprecation cycle or special consideration.Typical: Code must be
      updated to avoid an error, the update is simple to do in a way that
      works both on existing and future NumPy versions.Severe: Code will
      error or crash, and there is no simple work around or fix.Critical:
      Code returns incorrect results. A change requiring massive effort may
      fall here. A hard crash (e.g. segfault) in itself is
      typically not critical.
   2. Likelihood (How many users does the change affect?)Rare: Change has
      very few impacted users (or even no known users after a code search).
      The normal assumption is that there is always someone affected, but a
      rarely used keyword argument of an already rarely used function will
      fall here.Limited: Change is in a rarely used function or function
      argument. Another possibility is that it affects only a small group of
      very advanced users.Common: Change affects a bigger audience or
      multiple large downstream libraries.Ubiquitous: Change affects a large
      fraction of NumPy users.
The categories will not always be perfectly clear. That is OK. Rather
than establishing precise guidelines, the purpose is a
structured processes that can be reviewed. When the impact is
exceptionally difficult to assess, it is often feasible to try a change
on the development branch while signalling willigness to revert it.
Downstream libraries test against it (and the release candidate) which
gives a chance to correct an originally optimistic assessment.

After assessing each impact, it will fall somewhere on the following
table:
Severity\LikelyhoodRareLimitedCommonUbiquitousMinorokokok?Typicalok?no?Severeno?noCriticalno?nonono
Note that all changes should normally follow the two release
deprecation warning policy (except “minor” ones). The “no” fields means
a change is clearly unacceptable, although a NEP can always overrule
it. This table only assesses the “impact”. It does not assess how the
impact compares to the benefits of the proposed change. This must be
favourable no matter how small the impact is. However, by assessing the
impact, it will be easier to weigh it against the benefit. (Note that
the table is not symmetric. An impact with “critical” severity is
unlikely to be considered even when no known users are impacted.)
Mitigation and arguing of benefitsAny change falling outside the “ok” fields requires careful
consideration. When an impact is larger, you can try to mitigate it and
“move” on the table. Some possible reasons for this are:
 * A avoidable warning for at least two releases (the policy for any
   change that modifies behaviour) reduces a change one category (usually
   from “typical” to “minor” severity).
 * The severity category may be reduced by creating an easy work around
   (i.e. to move it from “sever” to “typical”).
 * Sometimes a change may break working code, but also fix existing bugs,
   this can offset the severity. In extreme cases, this may warrant
   classifying a change as a bug-fix.
 * For particularly noisy changes (i.e. ubiquitous category) considering
   fixing downstream packages, delay the warning (or use
   a PendingDeprecationWarning). Simply prolonging the the deprecation
   period is also an option. This reduces how many users struggle with the
   change and smoothens the transition.
 * Exceptionally clear documentation and communication could be used to
   ensure that the impact is more acceptable. This may not be enough to
   move a “category” by itself, but also helps.
After mitigation, the benefits can be assessed:
 * Any benefit of the change can be argued to “offset” the impact. If this
   is necessary, a broad community discussion on the mailing list is
   required. It should be clear that this does not actually “mitigate” the
   impact but rather argues that the benefit outweighs it.
These are not a fixed set of rules, but provide a framework to assess
and then try to mitigate the impact of a proposed change to an
acceptable level. Arguing that a benefit can overcome multiple “impact”
categories will require exceptionally large benefits, and most likely a
NEP. For example a change with an initial impact classification of
“severe” and “ubiquitous” is unlikely to even be considered unless the
severity can be reduced.
Many deprecations will fall somewhere below or equal to a “typical and
limited” impact (i.e. removal of an uncommon function argument). They
recieve a deprecation warning to make the impact acceptable with a
brief discussiong that the change itself is worthwhile (i.e. the API is
much cleaner afterwards). Any more disruptive change requires broad
community discussion. This needs at least a discussion on the NumPy
mailing list and it is likely that the person proposing it will be
asked to write a NEP.
Summary and reasoning for this processessThe aim of this process and table is to provide a loose formalism with
the goal of:
 * Diligence: Following this process ensures detailed assessment of its
   impact without being distracted by the benefits. This is achieved by
   following well defined steps:Listing each potential impact (usually
   one).Assessing the severity.Assessing the likelihood.Discussing what
   steps are/can be taken to lower the impact ignoring any benefits.If the
   impact is not low at this point, this should prompt considering and
   listing of alternatives.Argue that the benefits outweigh the remaining
   impact. (This is a distinct step: the original impact assessment stands
   as it was.)
 * Transparency: Using this process for difficult decisions makes it
   easier for the reviewer and community to follow how a decision was made
   and criticize it.
 * Nuance: When the it is clear that an impact is larger than typical with
   will prompt more care and thought. In some cases it may also clarify
   that a change is lower impact than expected on first sight.
 * Experience: Using a similar formalism for many changes makes it easier
   to learn from past decisions by providing an approach to compare and
   conceptualize them.
We aim to follow these steps in the future for difficult decisions. In
general, any reviewer and community member may ask for this process to
be followed for a proposed change, if the change is difficult, it will
be worth the effort. If it is very low impact it will be quick to
clarify why.
NOTE: At this time the process is new and is expected to require
clarification.
ExamplesIt should be stressed again, that the categories will rarely be clear
and intentially are categorized with some uncertainty below. Even
unclear categories can help in forming a more clear idea of a change.
HistogramThe “histogram” example doesn’t really add much with respect to this
process. But noting the duplicate effort/impact would move probably
move it into a more severe category than most deprecations. That makes
it a more difficult decision and indicates that careful thought should
be spend on alternatives.
Integer indexing requirement * Severity: Typical–Severe (although fairly easy, users often had to do
   many changes)
 * Likelihood: Ubiquitous
How ubiquitous it really was became probably only clear after the (rc?)
release. The change would now probably go through a NEP as it initially
falls into the lower right part of the table. To get into the
“acceptable” part of the table we note that:
   1. Real bugs were caught in the processes (argued to reduce severity)
   2. The deprecation was delayed and longer than normally (argued to
      mitigate the number of affected users by giving much more time)
Even with these considerations, it still has a larger impact and
clearly requires careful thought and community discussion about the
benefits.
Removing financial functions * Severity: Severe (on the high end)
 * Likelihood: Limited (maybe common)
While not used by a large user base (limited), the removal is
disurptive (severe). The change ultimately required a NEP, since it is
not easy to weigh the maintainence advantage of removing the functions
against the impact to their users.
The NEP included the reduction of the severity by providing a work-
around: A pip installable package as a drop-in replacement (reducing
the severity). For heavy users of these functions this will still be
more severe than most deprecations, but it lowered the impact
assessment enough to consider the benefit of removal to outweigh the
impact.