Hi Chris,<div><br></div><div>thanks for fast reply and all recommendations in helps me much!</div><div><div>as you recommended me i used Pdfminer module to extract the text from pdf files and then with file.xreadlines() I  allocated the lines where my keyword ("factors in this case") appears.</div>


<div>Till now i extract just the lines but im wondering if its able to extract whole sentenses (only this)   where my keawords ("factors in this case") are located.</div><div><br></div><div>I used following script  >></div>


<div><br></div><div>import os, subprocess</div><div><br></div><div>path="C:\\PDF"  # insert the path to the directory of interest here</div><div>dirList=os.listdir(path)</div><div>for fname in dirList:</div><div>


    output =fname.rstrip(".pdf") + ".txt"</div><div>    subprocess.call(["C:\Python26\python.exe", "pdf2txt.py", "-o", output, fname])</div><div>    print fname</div><div>


    file = open(output)</div><div>    for line in file.xreadlines():</div><div>        if "driving" in line:</div><div>            print(line)</div><br>-------------------------------------------------------<br>


Robert Pazur<br>Mobile : +421 948 001 705<br>Skype  : ruegdeg<br>

<br><br><div class="gmail_quote">2011/5/6 Chris Rebert <span dir="ltr"><<a href="mailto:clp2@rebertia.com" target="_blank">clp2@rebertia.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div>On Thu, May 5, 2011 at 2:26 PM, Robert Pazur <<a href="mailto:pazurrobert@gmail.com" target="_blank">pazurrobert@gmail.com</a>> wrote:<br>

> Dear all,<br>

> i would like to access some text and count the occurrence as follows ><br>

> I got a lots of pdf with some scientific articles and i want to preview<br>

>  which words are usually related with for example "determinants"<br>

> as an example in the article is a sentence > ....elevation is the most<br>

> important determinant....<br>

> how can i acquire the "elevation" string?<br>

> of course i dont know where the sententence in article is located or which<br>

> particular word could there be<br>

> any suggestions?<br>

<br>

</div>Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then<br>

use something similar to n-grams on the extracted text, filtering out<br>

those that don't contain "determinant(s)". Then just keep a word<br>

frequency table for the remaining n-grams.<br>

<br>

Not-quite-pseudo-code:<br>

from collections import defaultdict, deque<br>

N = 7 # length of n-grams to consider; tune as needed<br>

buf = deque(maxlen=N)<br>

targets = frozenset(("determinant", "determinants"))<br>

steps_until_gone = 0<br>

word2freq = defaultdict(int)<br>

for word in words_from_pdf:<br>

    if word in targets:<br>

        steps_until_gone = N<br>

    buf.append(word)<br>

    if steps_until_gone:<br>

        for related_word in buf:<br>

            if related_word not in targets:<br>

                word2freq[related_word] += 1<br>

        steps_until_gone -= 1<br>

for count, word in sorted((v,k) for k,v in word2freq.iteritems()):<br>

    print(word, ':', count)<br>

<br>

Making this more efficient and less naive is left as an exercise to the reader.<br>

There may very well already be something similar but more<br>

sophisticated in NLTK[4]; I've never used it, so I dunno.<br>

<br>

[1]: <a href="http://www.unixuser.org/~euske/python/pdfminer/index.html" target="_blank">http://www.unixuser.org/~euske/python/pdfminer/index.html</a><br>

[2]: <a href="http://pybrary.net/pyPdf/" target="_blank">http://pybrary.net/pyPdf/</a><br>

[3]: <a href="http://www.reportlab.com/software/#pagecatcher" target="_blank">http://www.reportlab.com/software/#pagecatcher</a><br>

[4]: <a href="http://www.nltk.org/" target="_blank">http://www.nltk.org/</a><br>

<br>

Cheers,<br>

Chris<br>

<font color="#888888">--<br>

<a href="http://rebertia.com" target="_blank">http://rebertia.com</a><br>

</font></blockquote></div><br></div>