Script to extract text from PDF files

Svenn Are Bjerkem svenn.bjerkem at googlemail.com
Wed Sep 26 16:49:10 EDT 2007


On Sep 25, 9:18 pm, byte8b... at gmail.com wrote:
> On Sep 25, 3:02 pm, Paul Hankin <paul.han... at gmail.com> wrote:
>
> > Googling for 'pdf to text python' and following the first link giveshttp://pybrary.net/pyPdf/
>
> Doesn't work that well, I've tried it, you should too... the author
> even admits this:
>
> extractText() [#]
>
>     Locate all text drawing commands, in the order they are provided
> in the content stream, and extract the text. This works well for some
> PDF files, but poorly for others, depending on the generator used.
> This will be refined in the future. Do not rely on the order of text
> coming out of this function, as it will change if this function is
> made more sophisticated. - sourcehttp://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html

I have downloaded this package and installed it and found that the
text-extraction is more or less useless. Looking into the code and
comparing with the PDF spec show a very early implementation of text
extraction. Luckily it is possible to overwrite the textextraction
method in the base class without having to fiddle with the original
code. I tried to contact the developer to offer some help on
implementing text extraction, but he didn't answer my emails.
--
Svenn




More information about the Python-list mailing list