Script to extract text from PDF files
paul.hankin at gmail.com
Tue Sep 25 21:02:48 CEST 2007
On Sep 25, 6:41 pm, brad <byte8b... at gmail.com> wrote:
> I have a very crude Python script that extracts text from some (and I
> emphasize some) PDF documents. On many PDF docs, I cannot extract text,
> but this is because I'm doing something wrong. The PDF spec is large and
> complex and there are various ways in which to store and encode text. I
> wanted to post here and ask if anyone is interested in helping make the
> script better which means it should accurately extract text from most
> any pdf file... not just some.
> I know the topic of reading/extracting the text from a PDF document
> natively in Python comes up every now and then on comp.lang.python...
> I've posted about it in the past myself. After searching for other
> solutions, I've resorted to attempting this on my own in my spare time.
> Using apps external to Python (pdftotext, etc.) is not really an option
> for me. If someone knows of a free native Python app that does this now,
> let me know and I'll use that instead!
Googling for 'pdf to text python' and following the first link gives
More information about the Python-list