Errors with PyPdf

Sun Sep 26 22:49:21 EDT 2010

On 27/09/2010 01:39, flebber wrote:
> On Sep 27, 9:38 am, "w.g.sned... at gmail.com"<w.g.sned... at gmail.com>
> wrote:
>> On Sep 26, 7:10 pm, flebber<flebber.c... at gmail.com>  wrote:
>>
>>> I was trying to use Pypdf following a recipe from the Activestate
>>> cookbooks. However I cannot get it too work. Unsure if it is me or it
>>> is beacuse sets are deprecated.
>>
>>> I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
>>> NET.pdf" You could use anything I was just testing with it.
>>
>>> I was using the last script on that page that was most recently
>>> updated. I am using python 2.6.
>>
>>> http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
>>
>>> import pyPdf
>>
>>> def getPDFContent(path):
>>>      content = "C:\Components-of-Dot-NET.pdf"
>>>      # Load PDF into pyPDF
>>>      pdf = pyPdf.PdfFileReader(file(path, "rb"))
>>>      # Iterate pages
>>>      for i in range(0, pdf.getNumPages()):
>>>          # Extract text from page and add to content
>>>          content += pdf.getPage(i).extractText() + "\n"
>>>      # Collapse whitespace
>>>      content = " ".join(content.replace(u"\xa0", " ").strip().split())
>>>      return content
>>
>>> print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
>>> "ignore")
>>
>>> This is my error.
>>
>>> Warning (from warnings module):
>>>    File "C:\Documents and Settings\Family\Application Data\Python
>>> \Python26\site-packages\pyPdf\pdf.py", line 52
>>>      from sets import ImmutableSet
>>> DeprecationWarning: the sets module is deprecated
>>
>>> Traceback (most recent call last):
>>>    File "C:/Python26/Pdfread", line 15, in<module>
>>>      print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
>>> "ignore")
>>>    File "C:/Python26/Pdfread", line 6, in getPDFContent
>>>      pdf = pyPdf.PdfFileReader(file(path, "rb"))
>>
>> --->  IOError: [Errno 2] No such file or directory: 'Components-of-Dot->  NET.pdf'
>>
>> Looks like a issue with finding the file.
>> how do you pass the path?
>
> okay thanks I thought that when I set content here
>
> def getPDFContent(path):
>      content = "C:\Components-of-Dot-NET.pdf"
>
> that i was defining where it is.
>
> but yeah I updated script to below and it works. That is the contents
> are displayed to the interpreter. How do I output to a .txt file?
>
> import pyPdf
>
> def getPDFContent(path):
>      content = "C:\Components-of-Dot-NET.pdf"

That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.

>      # Load PDF into pyPDF
>      pdf = pyPdf.PdfFileReader(file(path, "rb"))

You're opening a file whose path is in 'path'.

>      # Iterate pages
>      for i in range(0, pdf.getNumPages()):
>          # Extract text from page and add to content
>          content += pdf.getPage(i).extractText() + "\n"

That appends to 'content'.

>      # Collapse whitespace

'content' now contains the text of the PDF, starting with 
r"C:\Components-of-Dot-NET.pdf".

>      content = " ".join(content.replace(u"\xa0", " ").strip().split())
>      return content
>
> print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
> "ignore")
>
Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.