[Spambayes] idea for tokenizer.crack_filename change
Tim Peters
tim.one@comcast.net
Wed, 18 Sep 2002 20:38:32 -0400
[Neale Pickett]
> In going over some of my spam, I was surprised to see that the following
> wasn't penalized:
>
> ------=_NextPart_000_0039_0173A692.99A692D0
> Content-Type: application/octet-stream; name="Video.pif"
> Content-Transfer-Encoding: base64
> Content-Disposition: attachment; filename="Video.pif"
The mostly likely reason for this is that you simply don't have many pif
files in your spam training set. Check it out.
> I can guarantee you that I've never been emailed a single .pif file
> from an actual human being :) But tokenizer.crack_filename only
> splits up filenames by path elements, so ".pif" never got scored.
Not so:
def crack_filename(fname):
yield "fname:" + fname
components = fname_sep_re.split(fname)
morethan1 = len(components) > 1
for component in components:
if morethan1:
yield "fname comp:" + component
pieces = urlsep_re.split(component)
if len(pieces) > 1:
for piece in pieces:
yield "fname piece:" + piece
fname_sep_re only splits on *path* components: forward slash, backward
slash, and colon. Each component in turn is then split on urlsep_re, which
includes a wide variety of de jure and de facto URL metacharacters. '.' is
among them, and the pif here should be yielding a
'fname piece:pif'
token. It it isn't, there's some sort of bug. The filename as a whole
should have been extracted via the
fname = msg.get_filename()
if fname is not None:
for x in crack_filename(fname):
yield 'filename:' + x
portion of crack_content_xyz(). A
'content-disposition:attachment'
token should have been produced by the code just before that. Similarly for
Content-Type. As the comments say, though, Content-Transfer-Encoding is
ignored because test results showed that including it changed results in
minor ways, for both better and worse, across distinct test runs.
>
> I suggest changing fname_sep_re to include ".", like so:
>
> fname_sep_re = re.compile(r'[./\\:]')
Nope. That's not what this regexp is for. If you're not seeing the tokens
mentioned above, there *is* a bug here, and I'd like to know about that.
But the mere fact that a pif token didn't make into the list of best
discriminators for this message doesn't mean anything.
> Unfortunately, I can't back up my suspicion that this is a good idea, as
> it results in an across-the-board tie on my corpora. Maybe someone with
> larger corpora could try it out. (Tim?)
I did find value in what crack_filename did else I wouldn't have added the
code <wink>. I don't know how much value I got specifically from finding
pif tokens, but I noticde once that finding .exe extensions seemed valuable
on a test run. As always, though, the idea is tokenize everything and not
think too much <wink -- this is ironic given how much sweat has gone into
tokenizing in effective ways!>.