Errors with PyPdf

Sun Sep 26 19:35:20 EDT 2010

On 27/09/2010 00:10, flebber wrote:
> I was trying to use Pypdf following a recipe from the Activestate
> cookbooks. However I cannot get it too work. Unsure if it is me or it
> is beacuse sets are deprecated.
>
The 'sets' module pre-dates the built-in 'set' class. The warning is
just to inform you that the module will be removed in due course (it's
still in Python 2.7, but not Python 3), so you can still use it in
those versions.

> I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
> NET.pdf" You could use anything I was just testing with it.
>
> I was using the last script on that page that was most recently
> updated. I am using python 2.6.
>
> http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/
>
> import pyPdf
>
> def getPDFContent(path):
>      content = "C:\Components-of-Dot-NET.pdf"
>      # Load PDF into pyPDF
>      pdf = pyPdf.PdfFileReader(file(path, "rb"))
>      # Iterate pages
>      for i in range(0, pdf.getNumPages()):
>          # Extract text from page and add to content
>          content += pdf.getPage(i).extractText() + "\n"
>      # Collapse whitespace
>      content = " ".join(content.replace(u"\xa0", " ").strip().split())
>      return content
>
> print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
> "ignore")
>
> This is my error.
>
>>>>
>
> Warning (from warnings module):
>    File "C:\Documents and Settings\Family\Application Data\Python
> \Python26\site-packages\pyPdf\pdf.py", line 52
>      from sets import ImmutableSet
> DeprecationWarning: the sets module is deprecated
>
> Traceback (most recent call last):
>    File "C:/Python26/Pdfread", line 15, in<module>
>      print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
> "ignore")
>    File "C:/Python26/Pdfread", line 6, in getPDFContent
>      pdf = pyPdf.PdfFileReader(file(path, "rb"))
> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
> NET.pdf'
>>>>

You put the file in C:\, but you didn't tell Python where it is. You
gave just the filename "Components-of-Dot-NET.pdf", and it's looking in
the current directory, which probably isn't C:\.

Try providing the full pathname:

     print 
getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore")