# Probabilistic unit tests?

duncan smith buzzard at invalid.invalid
Fri Jan 11 19:05:05 CET 2013

```On 11/01/13 01:59, Nick Mellor wrote:
> Hi,
>
> I've got a unit test that will usually succeed but sometimes fails. An occasional failure is expected and fine. It's failing all the time I want to test for.
>
> What I want to test is "on average, there are the same number of males and females in a sample, give or take 2%."
>
> Here's the unit test code:
> import unittest
> from collections import counter
>
> sex_count = Counter()
> for contact in range(self.binary_check_sample_size):
>      p = get_record_as_dict()
>      sex_count[p['Sex']] += 1
> self.assertAlmostEqual(sex_count['male'],
>                         sex_count['female'],
>                         delta=sample_size * 2.0 / 100.0)
>
> My question is: how would you run an identical test 5 times and pass the group *as a whole* if only one or two iterations passed the test? Something like:
>
>      for n in range(5):
>          # self.assertAlmostEqual(...)
>          # if test passed: break
>      else:
>          self.fail()
>
> (except that would create 5+1 tests as written!)
>
> Thanks for any thoughts,
>
> Best wishes,
>
> Nick
>

The appropriateness of "give or take 2%" will depend on sample size.
e.g. If the proportion of males should be 0.5 and your sample size is
small enough this will fail most of the time regardless of whether the
proportion is 0.5.

What you could do is perform a statistical test. Generally this involves
generating a p-value and rejecting the null hypothesis if the p-value is
below some chosen threshold (Type I error rate), often taken to be 0.05.
Here the null hypothesis would be that the underlying proportion of
males is 0.5.

A statistical test will incorrectly reject a true null in a proportion
of cases equal to the chosen Type I error rate. A test will also fail to
reject false nulls a certain proportion of the time (the Type II error
rate). The Type II error rate can be reduced by using larger samples. I
prefer to generate several samples and test whether the proportion of
failures is about equal to the error rate.

The above implies that p-values follow a [0,1] uniform density function
if the null hypothesis is true. So alternatively you could generate many
samples / p-values and test the p-values for uniformity. That is what I
generally do:

p_values = []
for _ in range(numtests):
values = data generated from code to be tested
p_values.append(stat_test(values))
test p_values for uniformity

The result is still a test that will fail a given proportion of the
time. You just have to live with that. Run your test suite several times
and check that no one test is "failing" too regularly (more often than
the chosen Type I error rate for the test of uniformity). My experience
is that any issues generally result in the test of uniformity being
consistently rejected (which is why a do that rather than just
performing a single test on a single generated data set).

In your case you're testing a Binomial proportion and as long as you're
generating enough data (you need to take into account any test
assumptions / approximations) the observed proportions will be
approximately normally distributed. Samples of e.g. 100 would be fine.
P-values can be generated from the appropriate normal
(http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval),
and uniformity can be tested using e.g. the Kolmogorov-Smirnov or
Anderson-Darling test
(http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).

I'd have thought that something like this also exists somewhere. How do
people usually test e.g. functions that generate random variates, or
other cases where deterministic tests don't cut it?

Duncan

```