Script to extract text from PDF files

Paul Hankin paul.hankin at gmail.com
Tue Sep 25 21:02:48 CEST 2007


On Sep 25, 6:41 pm, brad <byte8b... at gmail.com> wrote:
> I have a very crude Python script that extracts text from some (and I
> emphasize some) PDF documents. On many PDF docs, I cannot extract text,
> but this is because I'm doing something wrong. The PDF spec is large and
> complex and there are various ways in which to store and encode text. I
> wanted to post here and ask if anyone is interested in helping make the
> script better which means it should accurately extract text from most
> any pdf file... not just some.
>
> I know the topic of reading/extracting the text from a PDF document
> natively in Python comes up every now and then on comp.lang.python...
> I've posted about it in the past myself. After searching for other
> solutions, I've resorted to attempting this on my own in my spare time.
> Using apps external to Python (pdftotext, etc.) is not really an option
> for me. If someone knows of a free native Python app that does this now,
> let me know and I'll use that instead!

Googling for 'pdf to text python' and following the first link gives
http://pybrary.net/pyPdf/

--
Paul Hankin




More information about the Python-list mailing list