status of Programming by Contract (PEP 316)?

Fri Aug 31 11:56:19 EDT 2007

Paul Rubin <http://phr.cx@NOSPAM.invalid> wrote:
   ...
> Hi Alex, I'm a little confused: does Production Systems mean stuff
> like the Google search engine, which (as you described further up in
> your message) achieves its reliability at least partly by massive
> redundancy and failover when something breaks?

The infrastructure supporting that engine (and other things), yes.

>  In that case why is it
> so important that the software be highly reliable?  Is a software

Think "common-mode failures": if a program has a bug, so do all
identical copies of that program.  Redundancy works for cheap hardware
because the latter's "unreliability" is essentially free of common-mode
failures (when properly deployed): it wouldn't work against a design
mistake in the hardware units.  Think of the famous Pentium division
bug: no matter how many redundant but identical such buggy CPUs you
place in parallel to compute divisions, in the error cases they'll all
produce the same wrong results.  Software bugs generally work (or,
rather, fail to work;-) similarly to hardware design bugs.

There are (for both hw and sw) also classes of mistakes that don't quite
behave that way -- "occasional glitches" that are not necessarily
repeatable and are heavily state-dependent ("race conditions" in buggy
multitasking SW, for example; and many more examples for HW, where flaky
behavior may be triggered by, say, temperature situations).  Here, from
a systems viewpoint, you might have a system that _usually_ says that
10/2 is 5, but once in a while says it's 4 instead (as opposed to the
"Pentium division bug" case where it would always say 4) -- this is much
more likely to be caused by flaky HW, but might possibly be caused by
the SW running on it (or the microcode in between -- not that it makes
much of a difference one way or another from a systems viewpoint).

Catching such issues can, again, benefit from redundancy (and
monitoring, "watchdog" systems, health and sanity checks running in the
background, &c).  "Quis custodiet custodes" is an interesting problem
here, since bugs or flakiness in the monitoring/watchdog infrastructure
have the potential to do substantial global harm; one approach is to
invest in giving that infrastructure an order of magnitude more
reliability than the systems it's overseeing (for example by using more
massive and *simple* redundancy, and extremely straightforward
architectures).  There's ample literature in the matter, but it
absolutely needs a *systems* approach: focusing just on the HW, just on
the SW, or just on the microcode in-between;-), just can't help much.

> some good hits they should display) but the server is never actually
> down, can you still claim 100% uptime?

I've claimed nothing (since all such measurements and methodologies
would no doubt be considered confidential unless and until cleared for
publication -- this has been done for a few whitepapers about some
aspects of Google's systems, but never to the best of my knowledge for
the "metasystem" as a whole), but rather pointed to
<http://uptime.pingdom.com/site/month_summary/site_name/www.google.com>,
a publically available site which does publish its methodology (at
<http://uptime.pingdom.com/general/methodology>); summarizing, as they
have no way to check that the results are "right" for the many sites
they keep an eye on, they rely on the HTTP result codes (as well as
validity of HTTP headers returned, and of course whether the site does
return a response at all).

> problem.  Of course then there's a second level system to manage the
> restarts that has to be very reliable, but it doesn't have to deal
> with much weird concocted input the way that a public-facing internet
> application has to.

Indeed, Production Systems' software does *not* "have to deal" with
input from the general public -- it's infrastructure, not user-facing
applications (except in as much as the "users" are Google engineers or
operators, say).  IOW, it's *exactly* the code that "has to be very
reliable" (nice to see that we agree on this;-), and therefore, if as
you then said "Russ's point stands", would NOT be in Python -- but it
is. So, I disagree about the "standing" status of his so-called "point".

> Therefore I think Russ's point stands, that we're talking about a
> different sort of reliability in these highly redundant systems, than
> in the systems Russ is describing.

Russ specifically mentioned *mission-critical applications* as being
outside of Python's possibilities; yet search IS mission critical to
Google.  Yes, reliability is obtained via a "systems approach",
considering HW, microcode, SW, and other issues yet such as power
supplies, cooling units, network cables, etc, not as a single opaque big
box but as an articulated, extremely complex and large system that needs
testing, monitoring, watchdogging, etc, at many levels -- there is no
other real way to make systems reliable (you can't do it by just looking
at components in isolation).  Note that this does have costs and
therefore it needs to be deployed selectively, and the whole panoply may
chosen to be arrayed only for what an organization considers to be its
truly mission critical applications -- Google's "pillars" used to be
search and ads, but has more recently been officially declared to now be
"Search, Ads and Apps" (cfr
<http://searchengineland.com/070511-092730.php> and links therefrom for
some discussion about this), which may have implications... but Python
remains at the core of many aspects of our reliability strategy.

Alex