<br><br><div class="gmail_quote">On Wed, Jan 5, 2011 at 4:45 PM, Emile van Sebille <span dir="ltr"><<a href="mailto:emile@fenx.com">emile@fenx.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

On 1/5/2011 3:12 PM <a href="mailto:kanthony@woh.rr.com" target="_blank">kanthony@woh.rr.com</a> said...<div class="im"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I want to use Python to find all "\n" terminated<br>

strings in a PDF file, ideally returning string<br>

starting addresses.   Anyone willing to help?<br>

</blockquote>

<br></div>

pdflines = open(r'c:\shared\python_book_01.pdf').readlines()<br>

sps = [0]<br>

for ii in pdflines: sps.append(sps[-1]+len(ii))<br><font color="#888888">

<br>

Emile</font><div><div></div><div class="h5"><br>

<br>

-- <br>

<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/mailman/listinfo/python-list</a><br>

</div></div></blockquote></div>Bear in mind that pdf files often have compressed objects in them. If that is the case, then I would recommend opening the pdf in binary mode and figuring out how to deflate the correct objects before doing any searching. PyPDF is a package that might help with this though it could use some updating.