[spambayes-dev] RE: [Spambayes] How low can you go?

Tue Dec 23 20:24:49 EST 2003

In message:  <MHEGIFHMACFNNIMMBACAGEKJGPAA.nobody at spamcop.net>
             "Seth Goodman" <nobody at spamcop.net> writes:

>Thanks to all for replying.

Eh, I'm just satisfying one of my own vices: babbling at people while
scrambling to do the groundwork to back up my babble.

>Alex suggests that bidirectional maps are overkill and not to bother.

Hrm.  I think I'd rephrase to say that the maps are overkill for most
all of the individual tests/regimes that you might be interested in.
Furthermore, while we're just trying things out, it seems to make more
sense to do the tests individually as we come up with them, instead of
trying to make some over-arching generalization that could be used to
implement any of them.

>Alex also has some scripts that do much of what I am trying to do, but
>it sounds like they will only work in a procmail environment and not
>with Outlook, which is where I am stuck.

My scripts don't really work in a mail environment at all; they work
in an environment where data content (which happens to be RFC 822
formatted mail messages) is stored in files in a specific directory
structure with a special naming convention.  This structure is:

  Data/
    Ham/
      reservoir/
      Set1/
      Set2/
      ...
      SetN/
    Spam/
      reservoir/
      Set1/
      Set2/
      ...
      SetN/

Inside each of the bottom-level directories is a set of files named
with a 4-digit number, a dash, and a 6-digit number, such as 0267-045075.
The 4-digit number is a day-of-arrival indicator (for grouping vs.
periodic processes like the fixed retraining in the 'corrected' regime),
and the 6-digit number is a unique sequence number (for ordering all the
messages for behaviour-over-time analysis).

Note that the above structure can be used for Tim's cv tests, too;
his framework uses the directory hierarchy but doesn't care about the
file names.

More information on how I generate and manipulate this structure is
in the incremental.HOWTO.txt in the testtools directory of the project.
Also, the README-DEVEL.txt in the root of the project explains a lot
more about this structure and the other tools for manipulating it.

>I run an Outlook client in IMO mode and fetch mail with POP3.

To get at your raw mail messages, I'd stick a POP3 proxy in there which
saved each message into a separate file... but I'm a protocol weenie,
and there might be easier ways to get at the data.

>I understand that there are also a bunch of testing frameworks/harnesses
>checked in

Yes.  The testtools directory is your friend.

>and standard data sets to test against

This we do not have (in any significant quantity), for multiple reasons:

1) If we have a standard data set, then we'll end up with a tool that's
   good at classifying that data set, not random people's mail.

2) While sharing spam is fairly innocuous, sharing ham opens up all sorts
   of privacy concerns... and if we filter out private info from the stuff
   we share, then we're systematically neglecting a portion of the data
   we're trying to represent.

3) We seem to enjoy nagging each other into running tests on private
   datasets.  There seems to be some thought that if we nag enough people,
   someone will actually read the code that's being tested and point out
   where we're being stupid. <.5 wink>

>though it sounds like they don't work with Outlook, which is a real pity.

They don't really work with any mail hander, as mentioned above; instead,
they owrk on organized data, so you can rerun tests time and time again
after various fidgets and fixes.

The reason why Outlook is a particular problem is that Outlook mutilates
mail, irretrievably destroying the RFC 822 structure that it may have
once been delivered in.  A similar structure can theoretically be
recreated, but like many recreations, some information (like the
separators used in MIME encapsulation, etc) is not the same.

>So I'm again asking for direction in the initial, most important decisions.
>For testing message and hapax expiration with various training regimens
>under the Outlook environment (if that is even possible or reasonable):
>
>1) Do you recommend that I use the Outlook code base or ditch the Outlook
>plug-in and install the sbproxy version from source?  I hate to lose the
>integration and I don't even know if the proxy produces mbox-style mail
>folders that the myriad scripts already written can work with.

I'm strongly in favor of ditching Outlook entirely.

>2) Do you recommend I start with the existing database and modify it, or as
>Skip suggested, change over to a database that doesn't have the multi-thread
>corruption problem?

I'm not even sure if the test harnesses use a database backend at all;
I think they may be keeping everything in memory.  Dunno.  I haven't
looked at that in ages.

What I would suggest is starting with the existing test harnesses and
building from there.

>3) And finally, Skip previously suggested that I check out the CVS trunk.
>Is that still your recommendation?

Definitely.  Last I heard, there's a bunch of stuff (including all the
test info) that's in CVS but not in the binary distributions.

>Thanks for all your help.  I just want to avoid taking initial mis-steps
>that would make anything I put together useless to anybody else.  I also
>don't want to duplicate efforts that others who are experienced have already
>taken.

Reproducing what's gone before is useful.  Duplicating it is not so
useful.  Where the line is drawn between the two is something I'll
leave to someone else. ;-)

- Alex