[spambayes-dev] one bigram nit
T. Alexander Popiel
popiel at wolfskeep.com
Tue Dec 16 12:41:48 EST 2003
In message: <16350.42801.651892.388851 at montanaro.dyndns.org>
Skip Montanaro <skip at pobox.com> writes:
>
>I see one compatibility problem with the bigram stuff. We currently have a
>key in the database called 'saved state' which stores a tuple: (db version,
>spamcount, hamcount). If that is ever generated as a bigram the database
>will get hosed. If backwards compatibility is an issue you might want to
>choose a different bigram connector than ' '. If backwards compatibility
>isn't a big deal, I'd bump the PICKLE_VERSION value and choose another value
>for the state key, probably a non-string object.
I'd actually take a different approach: we should prefix all "natural"
tokens (defined elsewhere as those tokens generated by the whitespace
split over the message body) with "body:", so that text in the body
cannot conflict with our synthetic tokens of any flavor. As it stands,
I think that the words url:python and url:org would get confused with
parts of http://python.org, just because we don't have any protections
for naturals aliasing synthetics...
Backwards compatibility is overrated; retraining is easy.
- Alex
More information about the spambayes-dev
mailing list