[Tutor] New newbie question. [PDFs and Python]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 9 Jul 2002 14:32:33 -0700 (PDT)


On Tue, 9 Jul 2002, SA wrote:

> On 7/9/02 2:27 PM, "Danny Yoo" <dyoo@hkn.eecs.berkeley.edu> wrote:
>
> >
> > On Tue, 9 Jul 2002, SA wrote:
> >
> >> Can you read a pdf with Python?
> >>
> >> I know you can read a text file with:
> >>
> >> Inp =3D open("textfile", "r")
> >>
> >> Will the same thing work on pdf files:
> >>
> >> Inp =3D open("pdffile", "rb")
> >
> > Yes, we can read from pdf's in binary format.
> >
>
> The only problem is when I try to read a pdf file using "rb", python
> then displays a lot of pdf jibberish instead of the text that is in the
> pdf file on the next web page.

Yes --- this is because '.pdf' files aren't so portable without software
that knows how to interpret them.

Just as Python is an interpreter for Python programs, it might be accurate
to say that Adobe Acrobat is an interpreter for PDF "programs".  One main
difference, though, is that .PDF documents don't often come with
human-readable source code.  They are in binary format, and packaged in
this way to discourage people from looking into them independently of
Acrobat Reader.



> Is there something else I need to do to read the text lines with this
> method, or do I need to just skip this and try to use pdftotxt program
> instead?

I'd recommend using pdftotext for the moment: the program handles PDF's
pretty well, and other projects extensively use it to extract pdf text.


To make pdftotext work nicely as as a Python function, we can do something
like this:

###
def extractPDFText(pdf_filename):
    """Given an pdf file name, returns a new file object of the
    text of that PDF.  Uses the 'pdftotext' utility."""
    return os.popen("pdftotext %s -" % pdf_filename)
###


Here's a demonstration:

###
>>> f =3D extractPDFText('ortuno02.pdf')
>>> text =3D f.read()
>>> print text[:200]
EUROPHYSICS LETTERS
1 March 2002

Europhys. Lett., 57 (5), pp. 759=AD764 (2002)




Keyword detection in natural languages and DNA


###


It's not perfect, but it's a beginning.  *grin*  Hope this helps!