Windows compatibility - OCR [was: Unwanted stock solicitations]
Hi friends, OCR code's now been tweaked and tested to work in both WinXP and Win9x. This should work in unix as well. Here is a summary: 1. Put ocrad 0.16 in the path 2. Change the following in ImageStripper.py ocr = os.popen("ocrad -s %s -c %s -x %s < %s 2>ocrerr.txt" % (scale, charset, orf, pnmfile)) into this ocr_cmd = ur'ocrad -s %s -c %s "%s"' % (scale, charset, pnmfile) # os.popen3() returns [stdin, stdout, stderr] ocr = os.popen3( ocr_cmd )[1] 3. Change this if os.path.exists(program) and is_executable(program): into this if os.path.exists(program + ".exe") or ( os.path.exists(program) and is_executable(program) ): Because of the way the instruction is interpreted it does not produce fatal errors even if the file is not found. 4. Change this for line in open(orf): if line.startswith("lines"): nlines = int(line.split()[1]) if nlines: ctokens.add("image-text-lines:%d" % int(log2(nlines))) into this nlines = ctext.count('\n') if nlines: ctokens.add("image-text-lines:%d" % nlines ) 5. Finally I sugest you change the default scale from 1 to 2 like in this line scale = options["Tokenizer", "ocrad_scale"] or 2 Compile and enjoy. Happy coding :) Vibe
On Sat, 2006-11-04 at 00:17 +0100, Vibe Grevsen wrote:
5. Finally I sugest you change the default scale from 1 to 2 like in this line
scale = options["Tokenizer", "ocrad_scale"] or 2
Hello, changing this surely doesn't hurt but ocrad_scale it's already set to 2 in Options.py probably should be removed (or set to 2 as you suggest)
-- Luigi Pugnetti Symbolic S.p.A. V.le Mentana, 29 I-43100 Parma Italy Tel: +39 0521 708811 Fax: +39 0521 776190
On Sat, 2006-11-04 at 00:17 +0100, Vibe Grevsen wrote:
Hi friends,
OCR code's now been tweaked and tested to work in both WinXP and Win9x. This should work in unix as well.
Here is a summary:
1. Put ocrad 0.16 in the path
As a note, for Windows you need a copy of ocrad with skip patch that opens pnm files in binary mode otherwise ocrad will fail on a lot of files. Have you tried other ocr programs? I tried gocr and I think that its result are somewhat better but version 0.41 + pgm patch almost hangs (read it takes a _very_ long time to complete and uses all the available cpu) with some images, version 0.40 crashes on some other different kind of images. For Linux the latter result is somewhat better that the former (you get no tokens from the image but no other harm) but on Windows you get the Dr. Watson report windows that block the process (of course I may disable it but it's a system/user configuration). -- Luigi Pugnetti Symbolic S.p.A. V.le Mentana, 29 I-43100 Parma Italy Tel: +39 0521 708811 Fax: +39 0521 776190
Vibe> 4. Change this Vibe> for line in open(orf): Vibe> if line.startswith("lines"): Vibe> nlines = int(line.split()[1]) Vibe> if nlines: Vibe> ctokens.add("image-text-lines:%d" % Vibe> int(log2(nlines))) Vibe> into this Vibe> nlines = ctext.count('\n') Vibe> if nlines: Vibe> ctokens.add("image-text-lines:%d" % Vibe> nlines ) Not the same: % ocrad -x out.txt -o ocr.txt logo.pgm % wc -l ocr.txt 2 ocr.txt % cat out.txt # Ocr Results File. Created by GNU Ocrad version 0.15 source file logo.pgm total text blocks 1 text block 1 0 0 199 50 lines 1 line 1 chars 15 height 11 26 29 5 2; 1, '-'0 31 23 7 12; 1, ' '0 38 22 10 13; 2, 'U'1, 'u'0 51 25 8 10; 1, 'n'0 62 22 2 12; 2, 'l'1, '|'0 67 11 23 24; 0 89 22 7 13; 1, 'h'0 96 23 7 12; 1, ' '0 103 21 11 14; 1, 'A'0 118 25 5 10; 1, 'r'0 125 22 8 13; 1, 'h'0 136 25 8 10; 1, 'u'0 146 25 9 10; 1, '5'0 155 23 7 12; 1, ' '0 162 29 5 2; 1, '-'0 Note that the out.txt file suggests there is only one line in the file while the actual file contains two. It appears that's simply an off-by-one issue (maybe ocrad always adds a blank line to the end of its output text), though I've only looked at the above case and one other. Skip
Hi there :)
Vibe> 4. Change this
Vibe> for line in open(orf): Vibe> if line.startswith("lines"): Vibe> nlines = int(line.split()[1]) Vibe> if nlines: Vibe> ctokens.add("image-text-lines:%d" % Vibe> int(log2(nlines)))
Vibe> into this
Vibe> nlines = ctext.count('\n') Vibe> if nlines: Vibe> ctokens.add("image-text-lines:%d" % Vibe> nlines )
Not the same: ... Note that the out.txt file suggests there is only one line in the file while the actual file contains two. It appears that's simply an off-by-one issue (maybe ocrad always adds a blank line to the end of its output text), though I've only looked at the above case and one other.
You're right. Simply off-by-one. Tested on five images. nlines = ctext.count('\n') - 1 I also noted that the line number was often different from the perceived line count. (I.e. if you look at the image and try to estimate the number of lines). If python supports regexp's we could strip empty lines from the output before the count... It may be a good idea, but I suspect it is not significant however. Happy coding :) Vibe
participants (3)
-
Luigi Pugnetti -
skip@pobox.com -
Vibe Grevsen