Getting word frequencies from files which are in folder.

Alex Martelli aleax at mac.com
Thu Apr 5 06:37:55 CEST 2007


<krisbee1983 at gmail.com> wrote:

> > This sounds suspiciously like a homework assignment.
> > I don't think you'll get much help for this one, unless
> > you show some code you wrote yourself already with a specific
> > question about problems you're having....
> 
> Well you have some right. I will make it more specific.
> I have got something like that:
> 
> import os, os.path
> 
> def wyswietlanie_drzewa(dir_path):
> #function is reading folders and sub folders until it gets to a file.
>     for name in os.listdir(dir_path):
>         full_path = os.path.join(dir_path, name)
>         print full_path
>         if os.path.isdir(full_path):
>             wyswietlanie_drzewa(full_path)
> 
> My question is how to get word frequencies from this files?
> I will be glad to get any help.

You may want to consider os.walk as an alternative way to get all files;
it's easy to wrap it into a generator yielding all files in the subtree.

This, I would think, is the proper factoring in Python: have a generator
yielding each file, and a function taking a file and returning the word
frequencies for that one file.  This neatly separates the two halves of
the task -- and you can easily factor things down further...

Give a text file, you can iterate on it: the items are the lines.  Given
a line, you can extract all words in it and iterate on those: look at
the re module, and the \w feature of regular-expression pattern strings.
So, a generator that turns a file into a stream of words is also an easy
sub-task to accomplish.

Given a stream of words, and a set of "interesting words", it's easy to
count the occurrences of interesting words.  There, I'll supply that
part, to entice you to write the others, and thereby perhaps learn some
Python...:

def count_interesting_words(all_words, interesting_words):
    d = dict.fromkeys(interesting_words, 0)
    for word in all_words:
       if word in d: d[word] += 1
    return d


Alex



More information about the Python-list mailing list