[Spambayes] to From_ or not to From_?
Sun, 29 Sep 2002 01:14:40 -0400
[Guido, on Unix "From " lines, and that Tim doesn't have any]
> Weird. I used splitndirs.py to create my normalized test data setup
> and it wrote Unix From lines. In fact, looking at the code, it uses
> str(msg), which forces unixfrom=1, which always writes a Unix From
> But it's possible that you created your data setup using a different
> version of splitndirs.py.
No, it's not this deep a mystery: I wrote a little script whose only
purpose was to remove these lines, and to normalize line endings. At the
start of this, I had no idea how the email pkg worked, and didn't know
whether having one corpus with these lines and one without was going to
screw up my results. So I forced uniformity. At the time, I figured I was
the only person who would "suffer" from mixed-source corpora too, so didn't
bother polishing that script and checking it in.
> Anyway, the email package always recognizes a Unix From line (it's
> hard to mistake for an rfc822 header line) and stores it in a special
> attribute of the Message object. Unless you wrote code in your
> tokenizer to look at that, I'm pretty sure you're ignoring it. :-)
Yup! I realize that now.
> So Skip can stop worrying: presence or absence of Unix From lines
> doesn't matter.