[Spambayes] Windows compatibility - OCR [was: Unwanted stock solicitations]
skip at pobox.com
skip at pobox.com
Sun Nov 5 20:29:39 CET 2006
Vibe> 4. Change this
Vibe> for line in open(orf):
Vibe> if line.startswith("lines"):
Vibe> nlines = int(line.split()[1])
Vibe> if nlines:
Vibe> ctokens.add("image-text-lines:%d" %
Vibe> int(log2(nlines)))
Vibe> into this
Vibe> nlines = ctext.count('\n')
Vibe> if nlines:
Vibe> ctokens.add("image-text-lines:%d" %
Vibe> nlines )
Not the same:
% ocrad -x out.txt -o ocr.txt logo.pgm
% wc -l ocr.txt
2 ocr.txt
% cat out.txt
# Ocr Results File. Created by GNU Ocrad version 0.15
source file logo.pgm
total text blocks 1
text block 1 0 0 199 50
lines 1
line 1 chars 15 height 11
26 29 5 2; 1, '-'0
31 23 7 12; 1, ' '0
38 22 10 13; 2, 'U'1, 'u'0
51 25 8 10; 1, 'n'0
62 22 2 12; 2, 'l'1, '|'0
67 11 23 24; 0
89 22 7 13; 1, 'h'0
96 23 7 12; 1, ' '0
103 21 11 14; 1, 'A'0
118 25 5 10; 1, 'r'0
125 22 8 13; 1, 'h'0
136 25 8 10; 1, 'u'0
146 25 9 10; 1, '5'0
155 23 7 12; 1, ' '0
162 29 5 2; 1, '-'0
Note that the out.txt file suggests there is only one line in the file while
the actual file contains two. It appears that's simply an off-by-one issue
(maybe ocrad always adds a blank line to the end of its output text), though
I've only looked at the above case and one other.
Skip
More information about the SpamBayes
mailing list