Re: [spambayes-dev] effective tokenizer for wiki text
>> So far, I think most of us have bent our input to look like email. I >> think that would be a lot easier than writing and debugging a new >> tokenizer. Matt> Yes, I think it would be fine to start testing the filter that Matt> way, but I figured since the custom tokenizer had been suggested Matt> it was worth looking into what would be required and what the Matt> advantages might be. Maybe subclass tokenizer.Tokenizer and override the tokenize method? Skip
[Matt]
Yes, I think it would be fine to start testing the filter that way, but I figured since the custom tokenizer had been suggested it was worth looking into what would be required and what the advantages might be.
[Skip]
Maybe subclass tokenizer.Tokenizer and override the tokenize method?
That's all that's needed. Just changing: def tokenize(self, obj): msg = self.get_message(obj) for tok in self.tokenize_headers(msg): yield tok for tok in self.tokenize_body(msg): yield tok to def tokenize(self, obj): text = obj # The rest of this is from tokenize_body. # Replace numeric character entities (like a for the letter # 'a'). text = numeric_entity_re.sub(numeric_entity_replacer, text) # Normalize case. text = text.lower() if options["Tokenizer", "replace_nonascii_chars"]: # Replace high-bit chars and control chars with '?'. text = text.translate(non_ascii_translate_tab) for t in find_html_virus_clues(text): yield "virus:%s" % t # Get rid of uuencoded sections, embedded URLs, <style gimmicks, # and HTML comments. for cracker in (crack_uuencode, crack_urls, crack_html_style, crack_html_comment, crack_noframes): text, tokens = cracker(text) for t in tokens: yield t # Remove HTML/XML tags. Also . <br> and <p> tags should # create a space too. text = breaking_entity_re.sub(' ', text) # It's important to eliminate HTML tags rather than, e.g., # replace them with a blank (as this code used to do), else # simple tricks like # Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion # can be used to disguise words. <br> and <p> were special- # cased just above (because browsers break text on those, # they can't be used to hide words effectively). text = html_re.sub('', text) # Tokenize everything in the body. for w in text.split(): n = len(w) # Make sure this range matches in tokenize_word(). if 3 <= n <= maxword: yield w elif n >= 3: for t in tokenize_word(w): yield t should be enough to skip header tokenization (and not have to worry about putting headers or a blank line in front of the content) and skip the decoding parts of the tokenization (I assume the wiki content will be plain text and not application/octet, base64, qp, etc). The code that deals with HTML should probably be replaced with code that deals with Trac's wiki formatting. For email, SpamBayes gets rid of all tags, so Trac could similarly dump formatting characters ('', ''', and the like), or keep them (you'd have to test to see whether they were useful or not). Probably the code above that deals with uuencode, HTML styles, HTML comments, and breaking entities could be dropped as well. =Tony.Meyer
On Tue, 2006-10-31 at 13:51 +1300, Tony Meyer wrote:
[Matt]
Yes, I think it would be fine to start testing the filter that way, but I figured since the custom tokenizer had been suggested it was worth looking into what would be required and what the advantages might be.
[Skip]
Maybe subclass tokenizer.Tokenizer and override the tokenize method?
That's all that's needed. Just changing:
...snip...
should be enough to skip header tokenization (and not have to worry about putting headers or a blank line in front of the content) and skip the decoding parts of the tokenization (I assume the wiki content will be plain text and not application/octet, base64, qp, etc).
The code that deals with HTML should probably be replaced with code that deals with Trac's wiki formatting. For email, SpamBayes gets rid of all tags, so Trac could similarly dump formatting characters ('', ''', and the like), or keep them (you'd have to test to see whether they were useful or not). Probably the code above that deals with uuencode, HTML styles, HTML comments, and breaking entities could be dropped as well.
Thanks, that should give me a good starting point. I'll check back if I have any more questions. -- Matt Good
participants (3)
-
Matt Good -
skip@pobox.com -
Tony Meyer