"Strong typing vs. strong testing"

Thu Sep 30 12:14:01 EDT 2010

In comp.lang.lisp TheFlyingDutchman <zzbbaadd at aol.com> wrote:
> 
>>
>> More specifically, the claim made above:
>>
>> > in C I can have a function maximum(int a, int b) that will always
>> > work. Never blow up, and never give an invalid answer.
>>
>> is false. ?And it is not necessary to invoke the vagaries of run-time
>> input to demonstrate that it is false.
>>
> I don't think you demonstrated it is false. Any values larger than an
> int get truncated before they ever get to maximum. The problem does
> not lie with the maximum function. It correctly returns the maximum of
> whatever two integers it is provided. Calling it with values that are
> larger than an int, that get converted to an int _before_ maximum is
> called, is an issue outside of maximum.

After thinking for a bit. I believe I can demonstrate a situation
where indeed maximum could return the wrong answer and it isn't due to
being passed incorrect input.

If, in maximum, after the entrance to the function call but right
before the comparison, a signal handler gets invoked, walks the stack,
swaps the two values for a and b, and returns back into maximum. Then
maximum will do the wrong thing.  Since control flow was always in
a subgraph of the control flow graph through maximum, this would
classify as a failure given your strict view. (As an aside, one can
do the same thing with a debugger.)

Blocking the signals around the comparison and assignment of the
result to a temporary variable that you will return won't fix it.
This is because (in C) you must have a sequence point after the
unblocking of the signals and before the assignment of a temporary
variable holding the result to the return register, where, in fact,
another signal could arrive and again corrupt the results. Depending
upon the optimzation values of the compiler, it may or may not adjust
the ordering semantics of the assignment to the return register in
relation to the call to unblock the signals. The assignment of a
result to a return register is not defined to be something in C,
and can happen anywhere. But the C statements you used to write it
must adhere to sequence evaluation.

Since the signal handler could do anything, including completely
replacing the text segments and/r loaded libraries of the code or
move the PC to an arbitrary palce, I don't think you can "fix" this
problem. Ever.

If you think this is a pedantic case which never happens in practice,
I'm the maintainer of a well-known user space checkpointing system
where these types of problems have to be thought about deeply because
they happen.

In addition, there are other modes of error injection: in compute
clusters with very high density memory that is not ECC, you can
actually calculate the probability that a bit will flip at an address
in memory due to cosmic rays. That probability is disturbingly high.

Just an idle search online produced this article:

http://news.cnet.com/8301-30685_3-10370026-264.html

which mentions some statistics. Think 1 billion hours is a lot and
"it'll never happen"?

There are 8760 hours in a year. So, you'd only need 114,156 computers
in a cluster running for one year before amassing 1 billion hours
of computation. That isn't a big number for large financial companies,
google, etc, etc, etc to own.

As a fun statistic, the BlueGene/P supercomputer can have 884,736
processors with associated memory modules. According to the math
in the article, one BlueGene/P should see a max of ~600,000 memory
errors per year.

Sure, you might not think any of this is a problem, because your
home desktop always produces the right answer when balancing your
checkbook, but it is a matter of perception of scale. Lots of large
clusters and data movement houses go to great length to ensure data
integrity. Injecting wrong data 4 hours into a 6 month long workflow
running on thousands of computers really upsets the hell out of people.

I've run into physicists who simply run their buggy software over
and over and over again on the same data and do statistical analysis
on the results. They've come to the realization that they can't
find/fix/verify all the bugs in their code, so they assume they are
there and write systems which try to be mathematically robust to the
nature of the beast. It is cheaper to wait for 1000 runs of a program
to be computed on a cluster than to spend human time debugging a
difficult bug in the code.

So, mathematically, maximum can't fail inside of itself, realistically
while executing on a physical machine, you bet it'll fail. :)

-pete