access to some text string in PDFs

Chris Rebert clp2 at rebertia.com
Thu May 5 18:29:09 EDT 2011


On Thu, May 5, 2011 at 2:26 PM, Robert Pazur <pazurrobert at gmail.com> wrote:
> Dear all,
> i would like to access some text and count the occurrence as follows >
> I got a lots of pdf with some scientific articles and i want to preview
>  which words are usually related with for example "determinants"
> as an example in the article is a sentence > ....elevation is the most
> important determinant....
> how can i acquire the "elevation" string?
> of course i dont know where the sententence in article is located or which
> particular word could there be
> any suggestions?

Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then
use something similar to n-grams on the extracted text, filtering out
those that don't contain "determinant(s)". Then just keep a word
frequency table for the remaining n-grams.

Not-quite-pseudo-code:
from collections import defaultdict, deque
N = 7 # length of n-grams to consider; tune as needed
buf = deque(maxlen=N)
targets = frozenset(("determinant", "determinants"))
steps_until_gone = 0
word2freq = defaultdict(int)
for word in words_from_pdf:
    if word in targets:
        steps_until_gone = N
    buf.append(word)
    if steps_until_gone:
        for related_word in buf:
            if related_word not in targets:
                word2freq[related_word] += 1
        steps_until_gone -= 1
for count, word in sorted((v,k) for k,v in word2freq.iteritems()):
    print(word, ':', count)

Making this more efficient and less naive is left as an exercise to the reader.
There may very well already be something similar but more
sophisticated in NLTK[4]; I've never used it, so I dunno.

[1]: http://www.unixuser.org/~euske/python/pdfminer/index.html
[2]: http://pybrary.net/pyPdf/
[3]: http://www.reportlab.com/software/#pagecatcher
[4]: http://www.nltk.org/

Cheers,
Chris
--
http://rebertia.com



More information about the Python-list mailing list