[Spambayes] idea for tokenizer.crack_filename change
18 Sep 2002 16:34:31 -0700
In going over some of my spam, I was surprised to see that the following
Content-Type: application/octet-stream; name="Video.pif"
Content-Disposition: attachment; filename="Video.pif"
I can guarantee you that I've never been emailed a single .pif file from
an actual human being :) But tokenizer.crack_filename only splits up
filenames by path elements, so ".pif" never got scored.
I suggest changing fname_sep_re to include ".", like so:
fname_sep_re = re.compile(r'[./\\:]')
Unfortunately, I can't back up my suspicion that this is a good idea, as
it results in an across-the-board tie on my corpora. Maybe someone with
larger corpora could try it out. (Tim?)