Script to extract text from PDF files

byte8bits at gmail.com byte8bits at gmail.com
Tue Sep 25 15:18:51 EDT 2007


On Sep 25, 3:02 pm, Paul Hankin <paul.han... at gmail.com> wrote:
> Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/

Doesn't work that well, I've tried it, you should too... the author
even admits this:

extractText() [#]

    Locate all text drawing commands, in the order they are provided
in the content stream, and extract the text. This works well for some
PDF files, but poorly for others, depending on the generator used.
This will be refined in the future. Do not rely on the order of text
coming out of this function, as it will change if this function is
made more sophisticated. - source http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html




More information about the Python-list mailing list