[Twisted] Re: Property testing in Twisted

Aug. 26, 2022

      On Fri, Aug 26, 2022 at 4:15 PM Glyph <glyph@twistedmatrix.com> wrote:
...
On Aug 26, 2022, at 7:52 AM, Jean-Paul Calderone <
exarkun@twistedmatrix.com> wrote:
I think property testing is a valuable tool in the testing toolbox and
Twisted should consider adopting Hypothesis to employ it for some tests.
Thoughts?
I am somewhere in the neighborhood of +0 on this.  In principle, I love
hypothesis and I feel like it can potentially reveal a lot of correctness
issues, particularly in things like protocol parsing (hey, that's a thing
we do!).  In practice, my one experience using it in depth is with
Hyperlink and Klein, where the main effect of its deployment seems to have
been to have uninteresting test strategy bugs provoking spurious flaky test
failures downstream.
I think this is a great point.  Thanks for bringing it up.  The Klein case
is an interesting one but I think it also represents a class of failures
that we can easily set aside for Twisted.  Let me check my understanding of
the situation first: hyperlink exposes some Hypothesis strategies as part
of its public API for other libraries to use to help them write tests more
easily.  Klein uses one of these strategies and sometimes that strategy
itself fails while building a value.

If that's correct, the simplest solution for Twisted is to say that we
won't expose any Hypothesis strategies as part of the public API to be
consumed by other projects this way.  Of course this decision could be
revisited later after we convince ourselves we have a plan that would
prevent this failure mode - but for Hypothesis use that's purely internal
to Twisted, does it seem correct that this wouldn't be a problem?  Only
Twisted's own test suite would be subject to such failures (more on this
below).

I spent a lot of words exploring the other examples you mentioned but I've
also summarized at the bottom, in case anyone wants to save themselves 10
or so minutes.
...
Here's the Klein issue: https://github.com/twisted/klein/issues/561
One interesting detail of this particular bug is that Hypothesis *has* spotted
a real issue here, but not in the right way: in its public "generate some
valid URL representations for testing" strategy, it has instead
accidentally produced invalid garbage that Hyperlink should arguably deal
with somehow but which it should never produce as output because it's
garbage.  It also flagged the issue in the wrong project, because we saw
the failure in Klein and not in Hyperlink itself.
This is a bit off the point, but it looks like this particular instance is
only so troublesome because it has gone unfixed in hyperlink for so long?
The hyperlink issue doesn't seem to have gotten any attention from the
maintainers ... maybe along with all of the rest of hyperlink :/  Now I'm a
bit sad to see it isn't more actively maintained.
...
I didn't do this integration, so I may have missed its value.  Perhaps on
our way to this point, it revealed and fixed several important parsing bugs
in Hyperlink or data-handling issues in Klein, but I'm not aware of any.
So the devil's in the details here.  I don't know what configuration is
required to achieve this, but Hypothesis should be able to generate new bug
reports, but *not* novel blocking failures on other (presumably
unrelated) PRs.  It seems like everyone agrees this is how it's *supposed* to
be used, but in practice nobody bothers to figure out all the relevant
configuration details and it becomes a source of never-ending suite
flakiness that interrupts other work.  I haven't fully read through all of
these, but a quick search on Github reveals thousands and thousands of
issues like these, where others projects' test suites take a reliability
hit due to Hypothesis's stochastic nature:
- https://github.com/anachronauts/jeff65/issues/41
   - https://github.com/Scille/parsec-cloud/issues/2864
   - https://github.com/Scille/parsec-cloud/issues/2867
   - https://github.com/astropy/astropy/issues/10724
   - https://github.com/pytorch/pytorch/issues/9832
And I don't even know *how* to properly set up Hypothesis so this doesn't
happen.
It's great to have these examples to consider.  I see a few different kinds
of failures here.

   - jeff65#41 and astropy#10724 are running into something that I
   basically consider to be a misfeature of Hypothesis.  By default Hypothesis
   imposes a 200ms wallclock deadlock for strategies .  *Especially* on CI
   this is absolutely a recipe for irrelevant intermittent failures.  The
   deadline is configurable
   <https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.setting...>
   and it can also just be turned off, which is what I do at least for CI for
   all projects where I use Hypothesis (and sometimes in my regular dev
   environment too).  I see the motivation for this feature - it's not that
   hard to write strategies that absolutely explode in terms of complexity of
   the values being built and this comes with a corresponding explosion in
   their runtime.  I think this is the kind of thing you want to track with
   something like speed.twistedmatrix.com (no longer operational, I guess)
   - like any other performance concern.

   - pytorch#9832 *appears* to just be a bug in pytorch - but I admit I
   didn't dig very deep into this one.  The Hypothesis exception is saying
   that the code under test is itself non-deterministic.  There's not much
   Hypothesis can do if you use it to say you want "f(x) == y" and then half
   the time f(x) == z (for y != z) instead.  By *not* using Hypothesis, you
   might get a test suite that doesn't have this failure because you never
   notice the value of x that triggers the problem or you might get a test
   suite that has the failure but around 200x less often (because in the
   default settings Hypothesis is going to try to run your test function 200
   times on each test suite run) and so it's just more noticeable once you're
   using Hypothesis.  Hypothesis helpfully reports the specific case that's a
   problem (in the case of this pytorch failure, it says  it called the test
   function with batch_size=3, stride=1, pad=2, kernel=5, dilation=1,
   size=9, channels=7, order=u'NCHW', gc=, dc=[] and the first time it
   raised AssertionError: Gradient check failed for input X but on a
   subsequent call it did not.  Since the test appears to involve floating
   point math ... I'm not surprised, probably?  pytorch could probably
use the example
   feature to turn this from a non-deterministic failure into a
   deterministic failure, if they wanted.

   There is another configurable feature related to this.  You can "
   derandomize
   <https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.setting...>"
   runs so that, while Hypothesis still calls your test with semi-random
   values from your strategies a bunch of times, it gives it the same sequence
   of values on every test run - so if the test passes once in your dev env
   you can be pretty sure it will pass on CI.  Of course, this is because
   you're reducing the amount of testing you're letting Hypothesis do for
   you.  But still, flaky CI is annoying so maybe this is a reasonable way to
   configure CI.  The Hypothesis docs themselves suggest you could have most
   CI run derandomized but a separate dedicated job that runs randomized, not
   for each PR/commit, to discover new bugs - but keep them out of the usual
   development workflow.

   - The parsec-cloud failures ... are tricky.  parsec-cloud is using a
   very cool, advanced feature of Hypothesis for testing state machines where
   Hypothesis essentially builds a path through the state space of a state
   machine as the "given".  I've written a few of these and they're extremely
   powerful in terms of size of test code versus amount of coverage you get.
   But parsec-cloud is also using the "Bundle" feature from this part of
   Hypothesis which I've never really understand, *and *some kind of
   Hypothesis/trio integration that I'm not familiar with, *and* it seems
   to be exercising an *asynchronous* coroutine based state machine, *and*
   all of the output is filtered through pytest so it's not in the shape I'm
   used to ... Overall this really just means I don't know what's going on
   here.  Hypothesis *says* that one of the *strategies* is behaving
   non-deterministically but honestly I'm not even 100% I see what strategies
   this test code is using.  So, maybe this is an application bug, maybe it's
   a test bug, maybe it's a Hypothesis bug ... I dunno.

   I'd be happy to say that Twisted should steer clear of any asynchronous
   state machine testing with Hypothesis for the near future too.  Maybe even
   any synchronous state machine testing (but automat should *totally* be
   leveraging that, by the way).  Offhand I can think of a few areas of
   Twisted where the state machine testing might be handy but the regular
   stateless version is going to be useful much, much more often.

There is one other area worth mentioning - and that's trial integration.
Hypothesis basically works with trial with no extra work but it's not quite
as polished as it could be.  In particular, you can't use the given
decorator with a test method that returns a Deferred and sometimes trial's
stdout handling frustrates Hypothesis' attempts to report details about
failures.  I don't think either of these are very deep (and the former
doesn't really bother me personally much since I prefer not to write
Deferred-returning tests anymore) but they probably represent a bit more
development work that *should* happen if we do start using Hypothesis
widely in Twisted's test suite.  Of course, doing this work will also
benefit all downstream trial users who want to use Hypothesis.
...
So, in summary: I'd love to see Hypothesis help us discover and fix bugs
but if it is a choice between occasionally helping us find those bugs but
constantly blocking important work with difficult-to-debug failures in
unrelated parsing code that is now covered by hypothesis, or "nothing", I'd
go with "nothing".
So, I think the following steps/policies would address all of the concerns
described here:

   - Restrict Hypothesis usage to Twisted's own test suite (i.e., it is an
   implementation detail of a private part of Twisted, not the basis of any
   public interface)
   - Turn off Hypothesis' deadline feature, probably everywhere
   - At least for normal CI, and maybe for development environments, turn
   off Hypothesis' randomization feature.
   Perhaps set up a scheduled CI job that runs with randomization enabled.
   - At least initially - and maybe indefinitely - avoid the stateful
   testing functionality.

 Jean-Paul