pdf to text

Nils Oliver Kröger NO_Kroeger at gmx.de
Thu Jan 25 22:40:57 CET 2007

Hash: SHA1

have a look at the pdflib (www.pdflib.com). Their Text Extraction
Toolkit might be what you are looking for, though I'm not sure whether
you can use it detached from the pdflib itself.



tubby schrieb:
> I know this question comes up a lot, so here goes again. I want to read 
> text from a PDF file, run re searches on the text, etc. I do not care 
> about layout, fonts, borders, etc. I just want the text. I've been 
> reading Adobe's PDF Reference Guide and I'm beginning to develop a 
> better understanding of PDF in general, but I need a bit of help... this 
> seems like it should be easier than it is. Here's some code:
> import zlib
> fp = open('test.pdf', 'rb')
> bytes = []
> while 1:
>      byte = fp.read(1)
>      #print byte
>      bytes.append(byte)
>      if not byte:
>          break
> for byte in bytes:
>      op = open('pdf.txt', 'a')
>      dco = zlib.decompressobj()
>      try:
>          s = dco.decompress(byte)
>          #print >> op, s
>          print s
>      except Exception, e:
>          print e
>      op.close()
> fp.close()
> I know the text is compressed... that it would have stream and endstream 
> makers and BT (Begin Text) and ET (End Text) and that the uncompressed 
> text is enclosed in parenthesis (this is my text). Has anyone here done 
> this in a simple fashion? I've played with the pyPdf library some, but 
> it seems overly complex for my needs (merge PDFs, write PDFs, etc). I 
> just want a simple PDF text extractor.
> Thanks

Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


More information about the Python-list mailing list