[Tutor] PDF to TXT

Robert Berman bermanrl at cfl.rr.com
Sun Jan 23 17:56:14 CET 2011


Hi,

I am trying to convert .pdf files to .txt files. The script I am using
below is mostly taken from research done on Google and it appears to be
the one outline most consistently favored
(http://code.activestate.com/recipes/577095-convert-pdf-to-plain-text/).

I am using Win 7, Python 2.7.1.
My code:

#pdf2txt.py
import sys
import pyPdf
import os

def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + " \n"
# Collapse whitespace
# content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content

def main():
pdf = sys.argv[1]
filedir,filename = os.path.split(pdf)
nameonly = os.path.splitext(filename)
newname = nameonly[0] + ".txt"
outtxt = os.path.join(filedir,newname)
f = open(outtxt,'w')
f.write(getPDFContent(pdf))
f.close()

main()
exit()

==============================================================================================================

The program runs for a while and then dies while in one of the pypdf
functions.  The trace is below. Any insight into how to resolve this
situation will be most appreciated.

Thank you,

Robert

=======================================================================================================================
The trace I get is:
decimal.InvalidOperation: Invalid literal for Decimal: '.'
File "C:\Users\bermanrl\Projects\ScriptSearch\testdir\pdf2txt.py", line
28, in <module>
main()
File "C:\Users\bermanrl\Projects\ScriptSearch\testdir\pdf2txt.py", line
25, in main
f.write(getPDFContent(pdf))
File "C:\Users\bermanrl\Projects\ScriptSearch\testdir\pdf2txt.py", line
13, in getPDFContent
content += pdf.getPage(i).extractText() + " \n"
File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf
\pdf.py", line 1381, in extractText
content = ContentStream(content, self.pdf)
File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf
\pdf.py", line 1464, in __init__
self.__parseContentStream(stream)
File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf
\pdf.py", line 1503, in __parseContentStream
operands.append(readObject(stream, None))
File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf
\generic.py", line 87, in readObject
return NumberObject.readFromStream(stream)
File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf
\generic.py", line 234, in readFromStream
return FloatObject(name)
File "C:\Python27\Lib\site-packages\pyPdf-1.13-py2.7-win32.egg\pyPdf
\generic.py", line 207, in __new__
return decimal.Decimal.__new__(cls, str(value), context)
File "C:\Python27\Lib\decimal.py", line 548, in __new__
"Invalid literal for Decimal: %r" % value)
File "C:\Python27\Lib\decimal.py", line 3844, in _raise_error
raise error(explanation)



More information about the Tutor mailing list