[Spambayes] to From_ or not to From_?

Tim Peters tim.one@comcast.net
Sun, 29 Sep 2002 01:14:40 -0400

[Guido, on Unix "From " lines, and that Tim doesn't have any]
> Weird.  I used splitndirs.py to create my normalized test data setup
> and it wrote Unix From lines.  In fact, looking at the code, it uses
> str(msg), which forces unixfrom=1, which always writes a Unix From
> line.
> But it's possible that you created your data setup using a different
> version of splitndirs.py.

No, it's not this deep a mystery:  I wrote a little script whose only
purpose was to remove these lines, and to normalize line endings.  At the
start of this, I had no idea how the email pkg worked, and didn't know
whether having one corpus with these lines and one without was going to
screw up my results.  So I forced uniformity.  At the time, I figured I was
the only person who would "suffer" from mixed-source corpora too, so didn't
bother polishing that script and checking it in.

> Anyway, the email package always recognizes a Unix From line (it's
> hard to mistake for an rfc822 header line) and stores it in a special
> attribute of the Message object.  Unless you wrote code in your
> tokenizer to look at that, I'm pretty sure you're ignoring it. :-)

Yup!  I realize that now.

> So Skip can stop worrying: presence or absence of Unix From lines
> doesn't matter.