Hello all, I'd like to start a discussion (but hopefully a brief one!) about using property testing techniques in Twisted. For those who aren't familiar with them, a property test is a kind of unit test written such that it can assert a certain invariant holds over a range of inputs. This is in contrast with more conventional unit tests in which a single test asserts a certain invariant holds for a single input and typically many such tests are written to cover the necessary range of inputs. Twisted mostly uses conventional unit tests though it does include a few property tests written in an ad hoc way. If you search for loops in the test suite you quickly come across them. For example, twisted.test.test_amp.ParsingTests includes several. test_booleanValues does not use a loop but it is a property test. test_ParsingRoundTrip does use a loop and is also a property test. In these cases, the developer knew they wanted to exercise the implementation against more than one input but they were limited in how exhaustive they could be by the tools available. There is a popular library for property testing in Python called Hypothesis. I've pushed some changes to *a branch* <https://github.com/twisted/twisted/compare/26c3c04e55d0b25532709e1e1dc5d1433dde80f3..exarkun:twisted:hypothesis-examples?expand=1> to demonstrate usage (note: *not* currently for review) of some of the basic APIs and the tests that result: - The given decorator marks a test as a property test. The consequence of this is usually that the test method runs more than once. The "strategies" passed to given determine how it is called. - Strategies like tuples, lists, integers specify what values are passed to the decorated function. Each will correspond to a parameter the test method accepts. A simple strategy like integers will result in int values being passed. Many strategies can be composed. A strategy like tuples will result in tuple values being passed, with the elements of the value determined by the strategies passed to the tuples strategy. - Helper functions like assume support dealing with corner cases related to invalid inputs which are tricky to avoid in a strategy. If it is hard to constructively define exactly the right set of inputs, you can define a strategy that finds "too many" and filter out the bad ones afterwards. The test is then written to make an assertion that is correct with respect to the specific values the test is called with. For complex code, it can take a while to get used to writing tests in this style but for simpler code it's often only a matter of using the strategy-provided values instead of some hard-coded values. Once using given, the test method actually runs many times, each time with different values "drawn" from the given strategies. When the test fails, Hypothesis tries to "shrink" the values - it simplifies them as far as it can while still causing the test to fail. When it has simplified them as much as it can without making the test pass, those simpler values are reported. This means that Hypothesis can find complex inputs that break the code under test but when *simple* values will do, those are the ones a developer gets to work with. There are many other benefits to this style of testing, such as reduced effort required to write comprehensive tests and improved code re-use in the test suite. I won't describe them further here but you can learn more at the Hypothesis website - https://hypothesis.works/. I think property testing is a valuable tool in the testing toolbox and Twisted should consider adopting Hypothesis to employ it for some tests. Thoughts? Jean-Paul
On Fri, 26 Aug 2022 at 15:55, Jean-Paul Calderone <exarkun@twistedmatrix.com> wrote:
Hello all,
I'd like to start a discussion (but hopefully a brief one!) about using property testing techniques in Twisted.
For those who aren't familiar with them, a property test is a kind of unit test written such that it can assert a certain invariant holds over a range of inputs. This is in contrast with more conventional unit tests in which a single test asserts a certain invariant holds for a single input and typically many such tests are written to cover the necessary range of inputs.
Twisted mostly uses conventional unit tests though it does include a few property tests written in an ad hoc way. If you search for loops in the test suite you quickly come across them. For example, twisted.test.test_amp.ParsingTests includes several. test_booleanValues does not use a loop but it is a property test. test_ParsingRoundTrip does use a loop and is also a property test. In these cases, the developer knew they wanted to exercise the implementation against more than one input but they were limited in how exhaustive they could be by the tools available.
There is a popular library for property testing in Python called Hypothesis. I've pushed some changes to a branch to demonstrate usage (note: not currently for review) of some of the basic APIs and the tests that result:
The given decorator marks a test as a property test. The consequence of this is usually that the test method runs more than once. The "strategies" passed to given determine how it is called. Strategies like tuples, lists, integers specify what values are passed to the decorated function. Each will correspond to a parameter the test method accepts. A simple strategy like integers will result in int values being passed. Many strategies can be composed. A strategy like tuples will result in tuple values being passed, with the elements of the value determined by the strategies passed to the tuples strategy. Helper functions like assume support dealing with corner cases related to invalid inputs which are tricky to avoid in a strategy. If it is hard to constructively define exactly the right set of inputs, you can define a strategy that finds "too many" and filter out the bad ones afterwards.
The test is then written to make an assertion that is correct with respect to the specific values the test is called with. For complex code, it can take a while to get used to writing tests in this style but for simpler code it's often only a matter of using the strategy-provided values instead of some hard-coded values.
Once using given, the test method actually runs many times, each time with different values "drawn" from the given strategies. When the test fails, Hypothesis tries to "shrink" the values - it simplifies them as far as it can while still causing the test to fail. When it has simplified them as much as it can without making the test pass, those simpler values are reported. This means that Hypothesis can find complex inputs that break the code under test but when simple values will do, those are the ones a developer gets to work with.
There are many other benefits to this style of testing, such as reduced effort required to write comprehensive tests and improved code re-use in the test suite. I won't describe them further here but you can learn more at the Hypothesis website - https://hypothesis.works/.
I think property testing is a valuable tool in the testing toolbox and Twisted should consider adopting Hypothesis to employ it for some tests.
Thoughts?
Jean-Paul
Hi Jean-Paul Thanks for the info. I am +1 for using hypothesis. My initial reaction to hypothesis and hamcrest and attrs was -1 :) The reason was that it requires extra knowledge to contribute to Twisted. At the same time, I think that hypothesis and attrs should be considered kind of "stdlib" or best practice. So if you are Python dev, you should learn them :) -- Adi Roiban
On Aug 26, 2022, at 7:52 AM, Jean-Paul Calderone <exarkun@twistedmatrix.com> wrote:
I think property testing is a valuable tool in the testing toolbox and Twisted should consider adopting Hypothesis to employ it for some tests.
Thoughts?
I am somewhere in the neighborhood of +0 on this. In principle, I love hypothesis and I feel like it can potentially reveal a lot of correctness issues, particularly in things like protocol parsing (hey, that's a thing we do!). In practice, my one experience using it in depth is with Hyperlink and Klein, where the main effect of its deployment seems to have been to have uninteresting test strategy bugs provoking spurious flaky test failures downstream. Here's the Klein issue: https://github.com/twisted/klein/issues/561 <https://github.com/twisted/klein/issues/561> One interesting detail of this particular bug is that Hypothesis has spotted a real issue here, but not in the right way: in its public "generate some valid URL representations for testing" strategy, it has instead accidentally produced invalid garbage that Hyperlink should arguably deal with somehow but which it should never produce as output because it's garbage. It also flagged the issue in the wrong project, because we saw the failure in Klein and not in Hyperlink itself. I didn't do this integration, so I may have missed its value. Perhaps on our way to this point, it revealed and fixed several important parsing bugs in Hyperlink or data-handling issues in Klein, but I'm not aware of any. So the devil's in the details here. I don't know what configuration is required to achieve this, but Hypothesis should be able to generate new bug reports, but not novel blocking failures on other (presumably unrelated) PRs. It seems like everyone agrees this is how it's supposed to be used, but in practice nobody bothers to figure out all the relevant configuration details and it becomes a source of never-ending suite flakiness that interrupts other work. I haven't fully read through all of these, but a quick search on Github reveals thousands and thousands of issues like these, where others projects' test suites take a reliability hit due to Hypothesis's stochastic nature: https://github.com/anachronauts/jeff65/issues/41 <https://github.com/anachronauts/jeff65/issues/41> https://github.com/Scille/parsec-cloud/issues/2864 <https://github.com/Scille/parsec-cloud/issues/2864> https://github.com/Scille/parsec-cloud/issues/2867 <https://github.com/Scille/parsec-cloud/issues/2867> https://github.com/astropy/astropy/issues/10724 <https://github.com/astropy/astropy/issues/10724> https://github.com/pytorch/pytorch/issues/9832 <https://github.com/pytorch/pytorch/issues/9832> And I don't even know how to properly set up Hypothesis so this doesn't happen. So, in summary: I'd love to see Hypothesis help us discover and fix bugs but if it is a choice between occasionally helping us find those bugs but constantly blocking important work with difficult-to-debug failures in unrelated parsing code that is now covered by hypothesis, or "nothing", I'd go with "nothing". -g
On Fri, Aug 26, 2022 at 4:15 PM Glyph <glyph@twistedmatrix.com> wrote:
On Aug 26, 2022, at 7:52 AM, Jean-Paul Calderone < exarkun@twistedmatrix.com> wrote:
I think property testing is a valuable tool in the testing toolbox and Twisted should consider adopting Hypothesis to employ it for some tests.
Thoughts?
I am somewhere in the neighborhood of +0 on this. In principle, I love hypothesis and I feel like it can potentially reveal a lot of correctness issues, particularly in things like protocol parsing (hey, that's a thing we do!). In practice, my one experience using it in depth is with Hyperlink and Klein, where the main effect of its deployment seems to have been to have uninteresting test strategy bugs provoking spurious flaky test failures downstream.
I think this is a great point. Thanks for bringing it up. The Klein case is an interesting one but I think it also represents a class of failures that we can easily set aside for Twisted. Let me check my understanding of the situation first: hyperlink exposes some Hypothesis strategies as part of its public API for other libraries to use to help them write tests more easily. Klein uses one of these strategies and sometimes that strategy itself fails while building a value. If that's correct, the simplest solution for Twisted is to say that we won't expose any Hypothesis strategies as part of the public API to be consumed by other projects this way. Of course this decision could be revisited later after we convince ourselves we have a plan that would prevent this failure mode - but for Hypothesis use that's purely internal to Twisted, does it seem correct that this wouldn't be a problem? Only Twisted's own test suite would be subject to such failures (more on this below). I spent a lot of words exploring the other examples you mentioned but I've also summarized at the bottom, in case anyone wants to save themselves 10 or so minutes.
Here's the Klein issue: https://github.com/twisted/klein/issues/561
One interesting detail of this particular bug is that Hypothesis *has* spotted a real issue here, but not in the right way: in its public "generate some valid URL representations for testing" strategy, it has instead accidentally produced invalid garbage that Hyperlink should arguably deal with somehow but which it should never produce as output because it's garbage. It also flagged the issue in the wrong project, because we saw the failure in Klein and not in Hyperlink itself.
This is a bit off the point, but it looks like this particular instance is only so troublesome because it has gone unfixed in hyperlink for so long? The hyperlink issue doesn't seem to have gotten any attention from the maintainers ... maybe along with all of the rest of hyperlink :/ Now I'm a bit sad to see it isn't more actively maintained.
I didn't do this integration, so I may have missed its value. Perhaps on our way to this point, it revealed and fixed several important parsing bugs in Hyperlink or data-handling issues in Klein, but I'm not aware of any.
So the devil's in the details here. I don't know what configuration is required to achieve this, but Hypothesis should be able to generate new bug reports, but *not* novel blocking failures on other (presumably unrelated) PRs. It seems like everyone agrees this is how it's *supposed* to be used, but in practice nobody bothers to figure out all the relevant configuration details and it becomes a source of never-ending suite flakiness that interrupts other work. I haven't fully read through all of these, but a quick search on Github reveals thousands and thousands of issues like these, where others projects' test suites take a reliability hit due to Hypothesis's stochastic nature:
- https://github.com/anachronauts/jeff65/issues/41 - https://github.com/Scille/parsec-cloud/issues/2864 - https://github.com/Scille/parsec-cloud/issues/2867 - https://github.com/astropy/astropy/issues/10724 - https://github.com/pytorch/pytorch/issues/9832
And I don't even know *how* to properly set up Hypothesis so this doesn't happen.
It's great to have these examples to consider. I see a few different kinds of failures here. - jeff65#41 and astropy#10724 are running into something that I basically consider to be a misfeature of Hypothesis. By default Hypothesis imposes a 200ms wallclock deadlock for strategies . *Especially* on CI this is absolutely a recipe for irrelevant intermittent failures. The deadline is configurable <https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.settings.deadline> and it can also just be turned off, which is what I do at least for CI for all projects where I use Hypothesis (and sometimes in my regular dev environment too). I see the motivation for this feature - it's not that hard to write strategies that absolutely explode in terms of complexity of the values being built and this comes with a corresponding explosion in their runtime. I think this is the kind of thing you want to track with something like speed.twistedmatrix.com (no longer operational, I guess) - like any other performance concern. - pytorch#9832 *appears* to just be a bug in pytorch - but I admit I didn't dig very deep into this one. The Hypothesis exception is saying that the code under test is itself non-deterministic. There's not much Hypothesis can do if you use it to say you want "f(x) == y" and then half the time f(x) == z (for y != z) instead. By *not* using Hypothesis, you might get a test suite that doesn't have this failure because you never notice the value of x that triggers the problem or you might get a test suite that has the failure but around 200x less often (because in the default settings Hypothesis is going to try to run your test function 200 times on each test suite run) and so it's just more noticeable once you're using Hypothesis. Hypothesis helpfully reports the specific case that's a problem (in the case of this pytorch failure, it says it called the test function with batch_size=3, stride=1, pad=2, kernel=5, dilation=1, size=9, channels=7, order=u'NCHW', gc=, dc=[] and the first time it raised AssertionError: Gradient check failed for input X but on a subsequent call it did not. Since the test appears to involve floating point math ... I'm not surprised, probably? pytorch could probably use the example feature to turn this from a non-deterministic failure into a deterministic failure, if they wanted. There is another configurable feature related to this. You can " derandomize <https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.settings.derandomize>" runs so that, while Hypothesis still calls your test with semi-random values from your strategies a bunch of times, it gives it the same sequence of values on every test run - so if the test passes once in your dev env you can be pretty sure it will pass on CI. Of course, this is because you're reducing the amount of testing you're letting Hypothesis do for you. But still, flaky CI is annoying so maybe this is a reasonable way to configure CI. The Hypothesis docs themselves suggest you could have most CI run derandomized but a separate dedicated job that runs randomized, not for each PR/commit, to discover new bugs - but keep them out of the usual development workflow. - The parsec-cloud failures ... are tricky. parsec-cloud is using a very cool, advanced feature of Hypothesis for testing state machines where Hypothesis essentially builds a path through the state space of a state machine as the "given". I've written a few of these and they're extremely powerful in terms of size of test code versus amount of coverage you get. But parsec-cloud is also using the "Bundle" feature from this part of Hypothesis which I've never really understand, *and *some kind of Hypothesis/trio integration that I'm not familiar with, *and* it seems to be exercising an *asynchronous* coroutine based state machine, *and* all of the output is filtered through pytest so it's not in the shape I'm used to ... Overall this really just means I don't know what's going on here. Hypothesis *says* that one of the *strategies* is behaving non-deterministically but honestly I'm not even 100% I see what strategies this test code is using. So, maybe this is an application bug, maybe it's a test bug, maybe it's a Hypothesis bug ... I dunno. I'd be happy to say that Twisted should steer clear of any asynchronous state machine testing with Hypothesis for the near future too. Maybe even any synchronous state machine testing (but automat should *totally* be leveraging that, by the way). Offhand I can think of a few areas of Twisted where the state machine testing might be handy but the regular stateless version is going to be useful much, much more often. There is one other area worth mentioning - and that's trial integration. Hypothesis basically works with trial with no extra work but it's not quite as polished as it could be. In particular, you can't use the given decorator with a test method that returns a Deferred and sometimes trial's stdout handling frustrates Hypothesis' attempts to report details about failures. I don't think either of these are very deep (and the former doesn't really bother me personally much since I prefer not to write Deferred-returning tests anymore) but they probably represent a bit more development work that *should* happen if we do start using Hypothesis widely in Twisted's test suite. Of course, doing this work will also benefit all downstream trial users who want to use Hypothesis.
So, in summary: I'd love to see Hypothesis help us discover and fix bugs but if it is a choice between occasionally helping us find those bugs but constantly blocking important work with difficult-to-debug failures in unrelated parsing code that is now covered by hypothesis, or "nothing", I'd go with "nothing".
So, I think the following steps/policies would address all of the concerns described here: - Restrict Hypothesis usage to Twisted's own test suite (i.e., it is an implementation detail of a private part of Twisted, not the basis of any public interface) - Turn off Hypothesis' deadline feature, probably everywhere - At least for normal CI, and maybe for development environments, turn off Hypothesis' randomization feature. Perhaps set up a scheduled CI job that runs with randomization enabled. - At least initially - and maybe indefinitely - avoid the stateful testing functionality. Jean-Paul
Hi, I had this on my draft for some time... so I am sending what I have, as I don't know when I will have time to write more. On Sat, 27 Aug 2022 at 00:40, Jean-Paul Calderone <exarkun@twistedmatrix.com> wrote:
On Fri, Aug 26, 2022 at 4:15 PM Glyph <glyph@twistedmatrix.com> wrote:
[snip]
So, in summary: I'd love to see Hypothesis help us discover and fix bugs but if it is a choice between occasionally helping us find those bugs but constantly blocking important work with difficult-to-debug failures in unrelated parsing code that is now covered by hypothesis, or "nothing", I'd go with "nothing".
I have been using something similar (but much more simpler) to hypothesis.strategies for a long time in my project. For example instead of using received_data = b'binary-data' I have something like this, where 'mk' is a singleton with all find of data generators. received_data = mk.binary() The `binary` will make sure to return a value that will fail an UTF-8 decoding. This helped me a lot to detect accidental handling binary data as text.
So, I think the following steps/policies would address all of the concerns described here:
Restrict Hypothesis usage to Twisted's own test suite (i.e., it is an implementation detail of a private part of Twisted, not the basis of any public interface)
+1
Turn off Hypothesis' deadline feature, probably everywhere
Maybe put a timeout of 15 seconds. If it takes more than 15 seconds to generate in test value, then something might be wrong with the generation code. This should help detect instances in which extreme value generation code is implemented.
At least for normal CI, and maybe for development environments, turn off Hypothesis' randomization feature.
Agree.
Perhaps set up a scheduled CI job that runs with randomization enabled.
We can have a scheduled test run on trunk with randomization... but at this point I don't know who will have the time to follow the result of those tests. A scheduled test run on trunk can also help detect test failures that are generated by external dependencies.
At least initially - and maybe indefinitely - avoid the stateful testing functionality.
Agree. We can take it easy at first :) --------------- A
Jean-Paul
_______________________________________________ Twisted mailing list -- twisted@python.org To unsubscribe send an email to twisted-leave@python.org https://mail.python.org/mailman3/lists/twisted.python.org/ Message archived at https://mail.python.org/archives/list/twisted@python.org/message/HIEAR5OXSAJ... Code of Conduct: https://twisted.org/conduct
-- Adi Roiban
On Fri, Aug 26, 2022 at 10:52 AM Jean-Paul Calderone < exarkun@twistedmatrix.com> wrote:
Hello all,
I'd like to start a discussion (but hopefully a brief one!) about using property testing techniques in Twisted.
The discussion seems to have petered out. :) On the basis of at least a luke-warm reception to this idea, I've gone ahead and merged a branch that adds a Hypothesis dependency to the Trial test suite. Hypothesis is used to test some code that itself supports the testing of Trial itself - so this is very internal to Twisted if we need to replace the tests with "normal" tests then I don't see how there could be any impact on any public Twisted APIs. I also filed https://github.com/twisted/twisted/issues/11649 for the follow-up work to configure Hypothesis on CI to avoid non-deterministic failures in case there are any bugs in the code being tested by Hypothesis (or in those tests themselves). Jean-Paul
On Wed, Sep 7, 2022 at 7:37 PM Jean-Paul Calderone < exarkun@twistedmatrix.com> wrote:
On Fri, Aug 26, 2022 at 10:52 AM Jean-Paul Calderone < exarkun@twistedmatrix.com> wrote:
Hello all,
I'd like to start a discussion (but hopefully a brief one!) about using property testing techniques in Twisted.
The discussion seems to have petered out. :) On the basis of at least a luke-warm reception to this idea, I've gone ahead and merged a branch that adds a Hypothesis dependency to the Trial test suite. Hypothesis is used to test some code that itself supports the testing of Trial itself - so this is very internal to Twisted if we need to replace the tests with "normal" tests then I don't see how there could be any impact on any public Twisted APIs.
I also filed https://github.com/twisted/twisted/issues/11649 for the follow-up work to configure Hypothesis on CI to avoid non-deterministic failures in case there are any bugs in the code being tested by Hypothesis (or in those tests themselves).
Just a note that there is now a PR implementing this follow-up work waiting for review: https://github.com/twisted/twisted/pull/11672 Jean-Paul
participants (3)
-
Adi Roiban
-
Glyph
-
Jean-Paul Calderone