Reading Adobe PDF File

Mon Jan 30 08:22:13 EST 2012

On Sat, 2012-01-28 at 21:59 -0800, Chris Rebert wrote:
> On Sat, Jan 28, 2012 at 9:52 PM, Shrewd Investor <cltung at gmail.com> wrote:
> > I have a very large Adobe PDF file.  I was hoping to use a script to
> > extract the information for it.  Is there a way to loop through a PDF
> > file using Python?
> Haven't used it myself, but:
> http://www.unixuser.org/~euske/python/pdfminer/

It is very prone to hanging and/or crashing.  I haven't yet found a
really reliably way to read text from a PDF.

PyPDF provides a PdfFileReader class with an extractText method.  The
output is indeed the text although it can be a bit thorny to look at.

> > Or do I need to find a way to convert a PDF file into a text file?  If
> > so how?
> The pdf2txt.py script from the same package happens to do exactly this.

-- 
System & Network Administrator [ LPI & NCLA ]
<http://www.whitemiceconsulting.com>
OpenGroupware Developer <http://www.opengroupware.us>
Adam Tauno Williams