[spambayes-dev] Strange performance dip and DBRunRecoveryErrorretreat

Thu Jan 1 20:18:39 EST 2004

[Richie Hindle]
> As part of trying to reproduce the DBRunRecoveryError problems (a task
> that I'm giving up on for now - see below) I've written a script to
> hammer the core SpamBayes code, repeatedly training and classifying
> using faked-up messages.  It manages about 40 train-and-classify
> loops per second on my 2.4GHz P4, *except* between about 100 and 400
> messages, when the performance drops to about a tenth of that and
> then recovers.
>
> I've done enough investigation to know that the time is being spent
> in the core SpamBayes code and not my script,

Is that a true dichotomy?  That is, do you know, for example, that the time
is being spent in the core spambayes code as distinct from the Berkeley
database library, or distinct from random network traffic other programs are
engaging in?  Or is it that you just know it's not in your script, and you
divide the universe into "my script" and "the core SpamBayes code" here?

> that it's only the occasional message that takes a long time (around
> a second in a few cases) and that it can be either training or
> classifying that slows down.
>
> I've committed the script as testtools/hammer.py, and I offer this as
> a curiosity to anyone interested.  I'm not going to pursue this myself
> because I've never seen a similar complaint about real-world SpamBayes
> use.

Well, Python certainly doesn't make any real-time guarantees, and I doubt
Sleepycat, or even your OS, do either.  So long as it recovers, I don't
think there's anything worth investigating.  It could be Python resizing a
large dict, or an all-generations garbage collection cycle, or Sleepycat
rearranging its memory allocation, or the OS rearranging swap space, ...,
there's just no limit on what it *could* be.

> ....
> I don't think the script is going to be a lot of use in tracking down
> DBRunRecoveryErrors - it *will* reproduce them as it is, but only by
> mimicking a bug that was fixed in 1.0a6, and people have still been
> complaining about DBRunRecoveryErrors in 1.0a6 and 1.0a7.

Thanks for the effort!  Maybe somebody else can complicate it now in a way
that does provoke DBRunRecoveryErrors.  It's never what you expect <wink>.

> Having read up on full-mode bsddb, and bsddb-backed ZODB (including
> the phrases "The underlying Berkeley database technology requires
> maintenance, careful system resource planning, and tuning for
> performance." and "BerkeleyDB never deletes "old" log files.
> Eventually, if you do not maintain your Berkeley database by deleting
> "old" log files, you will run out of disk space") I've given up - for
> the moment at least - on trying to use full-mode bsddb (with or
> without ZODB).

That's par for the course for "a real" database.  Even plain
FileStorage-backed ZODB requires ongoing maintenance, including periodic
"packing" to prevent unbounded growth, and religiously observed backups.
It's all this extra hair that makes a real database robust against most of
the things that can go wrong.  But also for that reason, it's unusual to see
"a real database" solution in consumer-grade applications.

We could write our own database specialized to our project's specific needs,
and probably get that working faster and better than any general-purpose
beast.  But my interest in that was fully satisifed by pickling a giant dict
<wink>.

> sb_server users should use a pickle and be done with it.

I've been saying that for a decade <wink>.  Before you get too sick of it,
you might also want to investigate Neil Schemenauer's adaptation of
spambayes for cdb.  cdb is an efficient and essentially worry-free
disk-based database.  It buys this at the cost of *not* being incrementally
updatable:  you can replace the whole thing atomically, in one giant gulp,
but that's it.

If you don't need incremental training with instantly-visible effects, I bet
it's an excellent approach.  There are no worries about synchronizing
concurrent reads and writes, simply because there are no writes.  Looks like
there *are* worries about concurrent reads, though:

    http://cr.yp.to/cdb/reading.html

    Beware that these functions may rely on non-atomic operations
    on the fd ofile, such as seeking to a particular position and
    then reading.  Do not attempt two simultaneous database reads
    using a single ofile.

Robust all-purpose database implementation is damned hard.

> Maybe we should change the default.  Maybe it's five to two and I
> should be in bed.

It's a pit, isn't it?  If it's any consolation, even Unix mboxes get
corrupted, and nothing is simpler than "append at the end".