Fw: PDF library for reading PDF files
claird at lairds.com
Mon Jan 19 14:04:34 CET 2004
In article <oxEOb.96911$Vs3.36407 at twister.socal.rr.com>,
Robert Kern <rkern at ucsd.edu> wrote:
>Cameron Laird wrote:
>> In article <Xns9474CBDE9B2D7cpl19ghumspamgourmet at 126.96.36.199>,
>> Harald Massa <cpl.19.ghum at spamgourmet.com> wrote:
>>>>I am looking for a library in Python that would read PDF files and I
>>>>could extract information from the PDF with it. I have searched with
>>>>google, but only found libraries that can be used to write PDF files.
>>>reportlab has a lib called pagecatcher; it is fully supported with python,
>>>it is not free.
>> ReportLab's libraries are great things--but they do not "extract
>> information from the PDF" in the sense I believe the original
>> questioner intended.
>No, but ReportLab (the company) has a product separate from reportlab
>(the package) called PageCatcher that does exactly what the OP asked
>for. It is not open source, however, and costs a chunk of change.
Let's take this one step farther. Two posts now have
quite clearly recommended ReportLab's PageCatcher <URL:
http://reportlab.com/docs/pagecatcher-ds.pdf >. I
completely understand and agree that ReportLab supports
a mix of open-source, no-fee, and for-fee products, and
that PageCatcher carries a significant license fee. I
entirely agree that PageCatcher "read[s] PDF files ...
and ... extract[s] information from the PDF with it."
HOWEVER, I suspect that what the original questioner
meant by his words was some sort of PDF-to-text "extrac-
tion" (true?) and, unless PageCatcher has changed a lot
since I got my last copy, PDF-to-text is NOT one of its
Cameron Laird <claird at phaseit.net>
More information about the Python-list