[Python-Dev] Signals, threads, blocking C functions
Nick Maclaren
nmm1 at cus.cam.ac.uk
Tue Sep 5 11:07:12 CEST 2006
"Adam Olsen" <rhamph at gmail.com> wrote:
> On 9/4/06, Gustavo Carneiro <gjcarneiro at gmail.com> wrote:
>
> > Now, we've had this API for a long time already (at least 2.5
> > years). I'm pretty sure it works well enough on most *nix systems.
> > Event if it works 99% of the times, it's way better than *failing*
> > *100%* of the times, which is what happens now with Python.
>
> Failing 99% of the time is as bad as failing 100% of the time, if your
> goal is to eliminate the short timeout on poll(). 1% is quite a lot,
> and it would probably have an annoying tendency to trigger repeatedly
> when the user does certain things (not reproducible by you of course).
That can make it a lot WORSE that repeated failure. At least with hard
failures, you have some hope of tracking them down in a reasonable time.
The problem with exception handling code that goes off very rarely,
under non-reproducible circumstances, is that it is almost untestable
and that bugs in it are positive nightmares. I have been inflicted
with quite a large number in my time, and have a fairly good success
rate, but the number of people who know the tricks is decreasing.
Consider the (real) case where an unpredictable process on a large
server (64 CPUs) was failing about twice a week (detectably), with
no indication of how many failures were giving wrong answers. We
replaced dozens of DIMMs, took days of down time and got nowhere;
it then went hard (i.e. one failure a day). After a week's total
down time, with me spending 100% of my time on it and the vendor
allocating an expert at high priority, we cracked it. We were very
lucky to find it so fast.
I could give you other examples that were/are there years and decades
later, because the pain threshhold never got high enough to dedicate
the time (and the VERY few people with experience). I know of at
least one such problem in generic TCP/IP (i.e. on Linux, IRIX,
AIX and possibly Solaris) that has been there for decades and causes
occasional failure in most networked applications/protocols.
Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email: nmm1 at cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679
More information about the Python-Dev
mailing list