Help in reading the pdf file
claird at lairds.us
Sat Mar 28 18:05:27 CET 2009
In article <mailman.2823.1238221222.11746.python-list at python.org>,
Gabriel Genellina <gagsl-py2 at yahoo.com.ar> wrote:
>En Thu, 26 Mar 2009 18:31:31 -0300, M Kumar <tomanishkb at gmail.com>
>> I need to read pdf files and extract data from it, is there any way to
>> do it
>> through python.
>If you are interested in the text, I'd use ghostscript pdf2text (you may
>invoke it from inside python).
>Actually extracting text from a PDF is rather difficult. It's a
>"presentation" format (or "display" format); every word in the document
>might be absolutely positioned, there is no paragraph structure you can
I reinforce Gabriel's good advice with a few points of my own:
A. I used to try to index PDF's text extractors
While I haven't maintained this page in years,
it would take only a little motivation for me
to freshen it considerably.
B. My current favorite is pdftotext.
C. There are multiple "pdf2txt"-s, that is, dif-
ferent products which share a name. Notice
Gabriel's qualification that he is thinking
of the *GS* one.
D. Many times the best way to automate a business
process involving PDF demands a trek farther
"upstream", that is, identification of the
source of a text *before* it was rendered as
PDF. Do you have access to such sources?
More information about the Python-list