[Python-Dev] Signals, threads, blocking C functions

Nick Maclaren nmm1 at cus.cam.ac.uk
Tue Sep 5 11:07:12 CEST 2006


"Adam Olsen" <rhamph at gmail.com> wrote:
> On 9/4/06, Gustavo Carneiro <gjcarneiro at gmail.com> wrote:
> 
> >   Now, we've had this API for a long time already (at least 2.5
> > years).  I'm pretty sure it works well enough on most *nix systems.
> > Event if it works 99% of the times, it's way better than *failing*
> > *100%* of the times, which is what happens now with Python.
> 
> Failing 99% of the time is as bad as failing 100% of the time, if your
> goal is to eliminate the short timeout on poll().  1% is quite a lot,
> and it would probably have an annoying tendency to trigger repeatedly
> when the user does certain things (not reproducible by you of course).

That can make it a lot WORSE that repeated failure.  At least with hard
failures, you have some hope of tracking them down in a reasonable time.
The problem with exception handling code that goes off very rarely,
under non-reproducible circumstances, is that it is almost untestable
and that bugs in it are positive nightmares.  I have been inflicted
with quite a large number in my time, and have a fairly good success
rate, but the number of people who know the tricks is decreasing.

Consider the (real) case where an unpredictable process on a large
server (64 CPUs) was failing about twice a week (detectably), with
no indication of how many failures were giving wrong answers.  We
replaced dozens of DIMMs, took days of down time and got nowhere;
it then went hard (i.e. one failure a day).  After a week's total
down time, with me spending 100% of my time on it and the vendor
allocating an expert at high priority, we cracked it.  We were very
lucky to find it so fast.

I could give you other examples that were/are there years and decades
later, because the pain threshhold never got high enough to dedicate
the time (and the VERY few people with experience).  I know of at
least one such problem in generic TCP/IP (i.e. on Linux, IRIX,
AIX and possibly Solaris) that has been there for decades and causes
occasional failure in most networked applications/protocols.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679


More information about the Python-Dev mailing list