[Spambayes] Watch out for this
tim.one at comcast.net
Wed Sep 10 18:44:54 EDT 2003
[Balazs Attila Mihaly]
> as you know in html pages characters can be written as #<number>;
> where <number> represents the ASCII (or maybe UNICODE - I'm not
> sure) code of the character. Now, if you don't convert these characters
> back to their corresponding values ...
spambayes already decodes numeric character entities. That's what
# Replace numeric character entities (like a for the letter
text = numeric_entity_re.sub(numeric_entity_replacer, text)
in Tokenizer.tokenize_body() does.
It's a relatively recent addition. I didn't see false negatives due to this
trick before adding the decoding, but did get a number of irritating Unsures
that were stopped cold by doing this decoding.
More information about the Spambayes