broken pipes, import errors (LONG, INVOLVED)

Fri Mar 8 19:13:54 EST 2002

At work, I'm seeing an "interesting" problem in my production
environment, or maybe two problems. It's one of those problems that
crops up only irregularly, and when it does, its symptoms are somewhat
changeable. I'm hoping someone might recognize the symptoms and be
able to provide a clue about what we're seeing, or at least advice
toward resolution.

An Admission of Sorts
---------------------
  I should confess near the start that we're using Python 1.5.2 (slightly
  patched in the direction of 1.6) and a version of Zope that's so old that
  most people don't even like to think about it any more (2.1.3 for those
  that do). These run on top of security-hardened RedHat 6.2 boxes. So I
  recognize that the systems involved are not exactly up to date.

  On the other hand, the barriers to upgrading any of these are pretty
  substantial at a technical level. Consequently, upgrading will be pretty
  costly in terms of pure labor, regression testing, etc., and it's been
  slated for an off-peak season. Without a more substantial business case
  for why we need to upgrade than "because it's likely to be more stable",
  we need to muddle through as best we can.

  So with the confession out of the way, let me describe the situation.

Strange OSError
---------------
  Occassionally, the Medusa core of zope hiccups. The way it looks to me,
  the FCGI server has one end of a "trigger" object that contains a pipe.
  The asynccore engine behind Medusa has the other end of that pipe, and
  when the FCGI server "pulls the trigger", Medusa will wake up out of its
  select loop and deal with the data in its inbox.

  Well, every once in a while, when the FCGI server goes to "pull the
  trigger", an OSError: [Errno 32] Broken pipe. Now I think that means that
  the asynccore end of the pipe has gone and closed the pipe on us, but
  none of us can figure out why it would have done something that cruel.

  After this occurs, Zope is pretty much unusable, and must be restarted.
  This would be bad enough for most people, but it gets worse.

Bizarre Import Behavior
-----------------------
  Sometimes, when we get the strange OSError, even more Bad Things happen
  after we restart Zope. In particular, we have what we call the "import
  bug" (you know it's bad when you've got a name for it).

  We're pretty sure -- as in, all existing records indicate -- that the
  "import bug" only shows up on the heels of one of the broken pipes. That
  is, we get a broken pipe error from the FCGI server, we restart Zope,
  then we get this.

  "This" is extremely curious. In one of our Zope products, we have a
  sequence of logic that looks something like this:

  > statement = "import module.submodule.SubSubModule"
  > object_nm = "module.submodule.SubSubModule.Object"
  >
  > exec statement
  > object = eval(object_nm)

  Obviously, the names have been changed to protect the innocent and reduce
  the complications.

  When Zope starts back up, the app will execute this block of statements
  one or more times for several different values of "statement" and
  "object_nm".

  Sometimes, though, it fails. The way it fails is both unusual and a
  little unnerving.

  The first three lines execute flawlessly. No problem assigning local
  variables, no problem running the exec. But when we get to the fourth
  line, the eval fails with an AttributeError: SubSubModule not found.

  Now, this is a little strange, because we've just exec'ed the statement
  that imports it not a split second before. Further, we ARE finding
  "submodule", which would not be imported if it weren't for the exec'ing
  of statement.

  So we tried to do some tests to see what was really happening.

Things We've Tried
------------------
  It was clear to me pretty early on that we could reproduce the perceived
  behavior by running a script like the following:

  > import module.submodule.SubSubModule
  >
  > delattr(module.submodule, "SubSubModule")
  >
  > # this does nothing, really
  > import module.submodule.SubSubModule
  >
  > print eval("module.submodule.SubSubModule.Object")

  So the question became: who's importing and deleting our module?

  Well, it's not that simple. It turns out that three things that work in
  this scenario, don't in ours. Those three things are:

  - If you look in os.modules before the second import (or any subsequent
    imports), you'll see the cached module. Well, our exec'ed import was
    the first import of every module that failed.

  - If you wrap the "eval" in a try block, and reload() the module on
    failure, everything gets put back in place. In our case, though, a
    reload() appeared to work, just like the original import, but the
    module contained only a fraction of the names it was supposed to.

  - If you tear it down and stepwise import a piece at a time from the
    original files, it should work just like a reload(). For us, though,
    it neither works, nor fails silently like the reload() -- now it fails
    to import a name (call it AnotherObject) from yet another module
    (module.anothersub.YASSModule). The failure occurs during a
    load_source() call.

So. Has anyone seen anything like this before? Anyone have any ideas?

-- 
Geoff Gerrietts <geoff at gerrietts dot net>
-rw-rw-rw-:         permissions of the beast