Hi guys, I'm playing a little more with integrating the outlook addin with Skip's ocrad code and making some progress. However, I noticed that while spambayes was taking much less than a second to load my (bsddb) databases, it takes nearly 30 seconds to load the stats! Specifically, this line in addin.py is the culprit: self.stats = bayes_stats.Stats(bayes_options, self.classifier_data.message_db) I've not even looked inside that module yet, but that seems quite extreme, to the point I'm not sure the feature is worth that cost... I guess the code is reading each record of my message DB (which is 85MB) - but does anyone have any insights? Cheers, Mark
Mark> Specifically, this line in addin.py is the culprit: Mark> self.stats = bayes_stats.Stats(bayes_options, Mark> self.classifier_data.message_db) Mark> I've not even looked inside that module yet, but that seems quite Mark> extreme, to the point I'm not sure the feature is worth that Mark> cost... I guess the code is reading each record of my message DB Mark> (which is 85MB) - but does anyone have any insights? Yes, it appears to be doing just that. At the end of __init__ it calls self.CalculatePersistentStats() which loops over all the keys in the message_db. The author anticipated this in the docstring: Calculate the statistics totals (i.e. not this session). This is done by running through the messageinfo database and adding up the various information. This could get quite time consuming if the messageinfo database gets very large, so some consideration should perhaps be made about what to do then. It might be worth deferring that call until it's really needed (say, in GetStats()). Skip
On 12/20/06, skip@pobox.com <skip@pobox.com> wrote:
Mark> Specifically, this line in addin.py is the culprit:
Mark> self.stats = bayes_stats.Stats(bayes_options, Mark> self.classifier_data.message_db) [...] It might be worth deferring that call until it's really needed (say, in GetStats()).
The Stats object tracks two types of statistics: statistics for the current session and total statistics across all Outlook sessions. The total statistics are calculated as the value of the persistent statistics plus the accumulated statistics for the current session. The persistent statistics need to be totalled up before we start accumulating anything into the session statistics using the RecordClassification or RecordTraining methods. Otherwise, session stats accumulated up to the point where the persistent stats are calculated will be included twice. We can probably still defer the call if we are smart about the relationship between the persistent stats and session stats. At whatever point we actually calculate the value of the persistent stats, we need to be aware that the session statistics accumulated up to that point are already included in the message db and subtract those values from the persistent statistics values. Of course, this only solves part of the problem because we would still take a huge hit when displaying the statistics. It might be worth considering an optimization to store the actual statistics values instead of calculating them at the start of every Outlook session. The reason the stats are calculated from the message db is so that the user can reset the starting date for the statistics and still get accurate results. We could recalculate the persistent statistics only when the user changes the start date for the statistics, and store the summary values as a separate record in the message db or in a separate statistics db file. I've been incredibly swamped lately with the work that pays the bills, but I'll try to find some time over the holidays to take a look at this. -- Kenny Pitt
Kenny> Of course, this only solves part of the problem because we would Kenny> still take a huge hit when displaying the statistics. It might be Kenny> worth considering an optimization to store the actual statistics Kenny> values instead of calculating them at the start of every Outlook Kenny> session. That occurred to me after my reply. I suspect it's probably the way to go. Skip
On 12/20/06, skip@pobox.com <skip@pobox.com> wrote:
Kenny> Of course, this only solves part of the problem because we would Kenny> still take a huge hit when displaying the statistics. It might be Kenny> worth considering an optimization to store the actual statistics Kenny> values instead of calculating them at the start of every Outlook Kenny> session.
That occurred to me after my reply. I suspect it's probably the way to go.
I checked in an initial update to delay the calculation of the persistent stats until the GetStats() call because that was the easy update. In the case where you never actually view the stats in SpamBayes Manager, this should help. Let me know if you see any oddities in the stats calculation after this. The complete fix is a little more involved, so I'll have to defer that until I have more time to test it thoroughly. -- Kenny Pitt
I checked in an initial update to delay the calculation of the persistent stats until the GetStats() call because that was the easy update. In the case where you never actually view the stats in SpamBayes Manager, this should help. Let me know if you see any oddities in the stats calculation after this.
Thanks Kenny! I don't personally check the stats often enough to notice, but it certainly solved my problem :) Cheers, Mark.
On 12/20/06, Kenny Pitt <kenny.pitt@gmail.com> wrote:
On 12/20/06, skip@pobox.com <skip@pobox.com> wrote:
Kenny> Of course, this only solves part of the problem because we would Kenny> still take a huge hit when displaying the statistics. It might be Kenny> worth considering an optimization to store the actual statistics Kenny> values instead of calculating them at the start of every Outlook Kenny> session.
That occurred to me after my reply. I suspect it's probably the way to go.
I checked in an initial update to delay the calculation of the persistent stats until the GetStats() call because that was the easy update. In the case where you never actually view the stats in SpamBayes Manager, this should help. Let me know if you see any oddities in the stats calculation after this.
The complete fix is a little more involved, so I'll have to defer that until I have more time to test it thoroughly.
I just checked in an update to add permanent caching of the statistics. With an existing message info db that doesn't yet contain the cached statistics, you'll have the old startup delay one time to recalculate the missing statistics. After that, the statistics should be reloaded directly from the cache record in the message info db and startup will be much faster. There is a minimal performance hit on each message classification because I have to update the statistics in the db every time to keep them in sync. I think this will be pretty much unnoticeable in the grand scheme of things, but let me know if you find otherwise. -- Kenny Pitt
participants (3)
-
Kenny Pitt -
Mark Hammond -
skip@pobox.com