[spambayes-dev] one bigram nit

T. Alexander Popiel popiel at wolfskeep.com
Tue Dec 16 12:41:48 EST 2003


In message:  <16350.42801.651892.388851 at montanaro.dyndns.org>
             Skip Montanaro <skip at pobox.com> writes:
>
>I see one compatibility problem with the bigram stuff.  We currently have a
>key in the database called 'saved state' which stores a tuple: (db version,
>spamcount, hamcount).  If that is ever generated as a bigram the database
>will get hosed.  If backwards compatibility is an issue you might want to
>choose a different bigram connector than ' '.  If backwards compatibility
>isn't a big deal, I'd bump the PICKLE_VERSION value and choose another value
>for the state key, probably a non-string object.

I'd actually take a different approach: we should prefix all "natural"
tokens (defined elsewhere as those tokens generated by the whitespace
split over the message body) with "body:", so that text in the body
cannot conflict with our synthetic tokens of any flavor.  As it stands,
I think that the words url:python and url:org would get confused with
parts of http://python.org, just because we don't have any protections
for naturals aliasing synthetics...

Backwards compatibility is overrated; retraining is easy.

- Alex



More information about the spambayes-dev mailing list