[Spambayes] Windows compatibility - OCR [was: Unwanted stock solicitations]

skip at pobox.com skip at pobox.com
Sun Nov 5 20:29:39 CET 2006


    Vibe> 4. Change this

    Vibe>    for line in open(orf):
    Vibe>        if line.startswith("lines"):
    Vibe>            nlines = int(line.split()[1])
    Vibe>            if nlines:
    Vibe>                ctokens.add("image-text-lines:%d" %
    Vibe>                            int(log2(nlines)))


    Vibe> into this

    Vibe>    nlines = ctext.count('\n')
    Vibe>    if nlines:
    Vibe>        ctokens.add("image-text-lines:%d" %
    Vibe>                    nlines )

Not the same:

    % ocrad -x out.txt -o ocr.txt logo.pgm
    % wc -l ocr.txt
           2 ocr.txt
    % cat out.txt
    # Ocr Results File. Created by GNU Ocrad version 0.15
    source file logo.pgm
    total text blocks 1
    text block 1 0 0 199 50
    lines 1
    line 1 chars 15 height 11
     26  29  5  2; 1, '-'0
     31  23  7 12; 1, ' '0
     38  22 10 13; 2, 'U'1, 'u'0
     51  25  8 10; 1, 'n'0
     62  22  2 12; 2, 'l'1, '|'0
     67  11 23 24; 0
     89  22  7 13; 1, 'h'0
     96  23  7 12; 1, ' '0
    103  21 11 14; 1, 'A'0
    118  25  5 10; 1, 'r'0
    125  22  8 13; 1, 'h'0
    136  25  8 10; 1, 'u'0
    146  25  9 10; 1, '5'0
    155  23  7 12; 1, ' '0
    162  29  5  2; 1, '-'0

Note that the out.txt file suggests there is only one line in the file while
the actual file contains two.  It appears that's simply an off-by-one issue
(maybe ocrad always adds a blank line to the end of its output text), though
I've only looked at the above case and one other.

Skip



More information about the SpamBayes mailing list