Errors with PyPdf

Sun Sep 26 22:56:40 EDT 2010

On Sep 27, 12:49 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> On 27/09/2010 01:39, flebber wrote:
>
>
>
> > On Sep 27, 9:38 am, "w.g.sned... at gmail.com"<w.g.sned... at gmail.com>
> > wrote:
> >> On Sep 26, 7:10 pm, flebber<flebber.c... at gmail.com>  wrote:
>
> >>> I was trying to use Pypdf following a recipe from the Activestate
> >>> cookbooks. However I cannot get it too work. Unsure if it is me or it
> >>> is beacuse sets are deprecated.
>
> >>> I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
> >>> NET.pdf" You could use anything I was just testing with it.
>
> >>> I was using the last script on that page that was most recently
> >>> updated. I am using python 2.6.
>
> >>>http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
>
> >>> import pyPdf
>
> >>> def getPDFContent(path):
> >>>      content = "C:\Components-of-Dot-NET.pdf"
> >>>      # Load PDF into pyPDF
> >>>      pdf = pyPdf.PdfFileReader(file(path, "rb"))
> >>>      # Iterate pages
> >>>      for i in range(0, pdf.getNumPages()):
> >>>          # Extract text from page and add to content
> >>>          content += pdf.getPage(i).extractText() + "\n"
> >>>      # Collapse whitespace
> >>>      content = " ".join(content.replace(u"\xa0", " ").strip().split())
> >>>      return content
>
> >>> print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
> >>> "ignore")
>
> >>> This is my error.
>
> >>> Warning (from warnings module):
> >>>    File "C:\Documents and Settings\Family\Application Data\Python
> >>> \Python26\site-packages\pyPdf\pdf.py", line 52
> >>>      from sets import ImmutableSet
> >>> DeprecationWarning: the sets module is deprecated
>
> >>> Traceback (most recent call last):
> >>>    File "C:/Python26/Pdfread", line 15, in<module>
> >>>      print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
> >>> "ignore")
> >>>    File "C:/Python26/Pdfread", line 6, in getPDFContent
> >>>      pdf = pyPdf.PdfFileReader(file(path, "rb"))
>
> >> --->  IOError: [Errno 2] No such file or directory: 'Components-of-Dot->  NET.pdf'
>
> >> Looks like a issue with finding the file.
> >> how do you pass the path?
>
> > okay thanks I thought that when I set content here
>
> > def getPDFContent(path):
> >      content = "C:\Components-of-Dot-NET.pdf"
>
> > that i was defining where it is.
>
> > but yeah I updated script to below and it works. That is the contents
> > are displayed to the interpreter. How do I output to a .txt file?
>
> > import pyPdf
>
> > def getPDFContent(path):
> >      content = "C:\Components-of-Dot-NET.pdf"
>
> That simply binds to a local name; 'content' is a local variable in the
> function 'getPDFContent'.
>
> >      # Load PDF into pyPDF
> >      pdf = pyPdf.PdfFileReader(file(path, "rb"))
>
> You're opening a file whose path is in 'path'.
>
> >      # Iterate pages
> >      for i in range(0, pdf.getNumPages()):
> >          # Extract text from page and add to content
> >          content += pdf.getPage(i).extractText() + "\n"
>
> That appends to 'content'.
>
> >      # Collapse whitespace
>
> 'content' now contains the text of the PDF, starting with
> r"C:\Components-of-Dot-NET.pdf".
>
> >      content = " ".join(content.replace(u"\xa0", " ").strip().split())
> >      return content
>
> > print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
> > "ignore")
>
> Outputting to a .txt file is simple: open the file for writing using
> 'open', write the string to it, and then close it.

Thats what I was trying to do with

open('x.txt', 'w').write(content)

the rest of the script works it wont output the tect though