[spambayes-dev] Stats are very slow

Kenny Pitt kenny.pitt at gmail.com
Wed Dec 20 21:32:22 CET 2006


On 12/20/06, skip at pobox.com <skip at pobox.com> wrote:
>    Mark> Specifically, this line in addin.py is the culprit:
>
>    Mark>         self.stats = bayes_stats.Stats(bayes_options,
>    Mark>                                        self.classifier_data.message_db)
[...]
> It might be worth deferring that call until it's really needed (say, in
> GetStats()).

The Stats object tracks two types of statistics: statistics for the
current session and total statistics across all Outlook sessions. The
total statistics are calculated as the value of the persistent
statistics plus the accumulated statistics for the current session.
The persistent statistics need to be totalled up before we start
accumulating anything into the session statistics using the
RecordClassification or RecordTraining methods. Otherwise, session
stats accumulated up to the point where the persistent stats are
calculated will be included twice.

We can probably still defer the call if we are smart about the
relationship between the persistent stats and session stats. At
whatever point we actually calculate the value of the persistent
stats, we need to be aware that the session statistics accumulated up
to that point are already included in the message db and subtract
those values from the persistent statistics values.

Of course, this only solves part of the problem because we would still
take a huge hit when displaying the statistics. It might be worth
considering an optimization to store the actual statistics values
instead of calculating them at the start of every Outlook session. The
reason the stats are calculated from the message db is so that the
user can reset the starting date for the statistics and still get
accurate results. We could recalculate the persistent statistics only
when the user changes the start date for the statistics, and store the
summary values as a separate record in the message db or in a separate
statistics db file.

I've been incredibly swamped lately with the work that pays the bills,
but I'll try to find some time over the holidays to take a look at
this.

-- 
Kenny Pitt


More information about the spambayes-dev mailing list