how to test when correct output is "random"?

Sat Apr 5 13:52:02 EST 2003

Quoth Jeff Epler:
  [...]
> But other parts are based on randomness.  "Alternation rules" are the
> foundation on which this is built.  An alternation rule might read
> like this:
>     S: "a" | "b" | "c";
> Then, str(S) would yield one of the alternatives.  Over many runs, it
> should yield each choice about an equal number of times.  So how do
> I test this?  Run a large number of trials and do a statistical test?
> This doesn't seem to fit in with the philosophy of software testing,
> which stresses exact repeatability.

I've encountered this problem too, trying to write unit tests for
genetic algorithms.  I haven't found an entirely satisfactory
solution, but here's my thoughts on it so far.

1. Separate the deterministic behaviour from the random behaviour.

I really like this approach, where possible.

For example, above you could have
	>>> possibilitiesFor('S') # deterministic
	['a', 'b', 'c']
	>>> expand('S') # random
	'c'
	>>> expand('S')
	'c'
	>>> expand('S')
	'a'
Testing possibilitiesFor is straightforward.  And in this case,
expand is extremely simple:
	def expand(symbol):
		return random.choice(possibilitiesFor(symbol))
This is, imho, too simple to break; no tests required.  If you
*did* write a test for it, you'd really be testing random.choice,
which is not a worthwhile activity.

2. Mock the pseudo-random number generator.

I see that others have suggested this, both in the form of
supplying your own PRNG, and in the form of calling
random.setstate.  I don't much like this approach; in any but the
simplest cases (which generally don't need to be tested), it
couples the unit test to the exact use of random numbers in the
code under test.  That is, such tests test not only that the code
is functionally correct, but also that it uses the random numbers
in a specific way.  But why would you want to test that?

Consider two implementations of this function:
	def permute(seq):
		"""Returns a copy of seq, randomly permuted."""
		# ...
One implementation builds up the result by randomly selecting,
without replacement, elements from seq, until seq is exhausted.
The other implementation picks a single random integer in
[0..2**len(seq)) and applies the permutation corresponding to that
number (in some ordering of permutations).  These implementations
are functionally equivalent, but they make very different use of
the PRNG.  If I mock the PRNG, changing from one implementation to
the other will probably break my unit tests.

3. Do statistical tests on the output.

If you're worried that your function's output is not "random
enough", then do statistical tests.  I don't think there's any way
around this, short of proving your algorithm correct; the concern
in such cases must be that the function makes such complicated use
of the PRNG that it is difficult to be sure its output has a
desirable distribution.  (If it makes simple use of the PRNG,
testing the output distribution is testing the underlying PRNG,
which is, as noted above, undesirable.)

The annoying thing about statistical tests is that they can
generate false negatives.  But if you know more about statistics
than I do, perhaps you can code to make this possibility extremely
unlikely.

4. Do sanity checks on the output.

Run the random function n times and do sanity checks on the
output.  For the permute function, for example, you could verify
that its output actually is a permutation of its input, say, by
sorting both sequences and comparing them.  For your Dadaist
program, you could verify that the output actually matches the
production.

(Under this approach, it is desirable to record the state of the
PRNG before running the test, and to report that state as part of
the assertion-failed message.  Thus the troubleshooter can
duplicate failures deterministically.)

Such tests are not strictly speaking reliable, in the sense that
they can generate false positives.  But over time, with many
successful runs, they increase your confidence in the correctness
of the program, which is the point of testing in the first place.

-- 
Steven Taschuk                           staschuk at telusplanet.net
"I'm always serious, never more so than when I'm being flippant."
                            -- _Look to Windward_, Iain M. Banks