[Tutor] Most common words in a text file

Mark Lawrence breamoreboy at yahoo.co.uk
Sat Sep 30 17:02:31 EDT 2017


On 30/09/2017 18:12, Sri G. wrote:
> I'm learning programming with Python.
> 
> I’ve written the code below for finding the most common words in a text
> file that has about 1.1 million words. It's working fine, but I believe
> there is always room for improvement.
> 
> When run, the function in the script gets a text file from the command-line
> argument sys.argv[1], opens the file in read mode, converts the text to
> lowercase, makes a list of words from the text after removing any
> whitespaces or empty strings, and stores the list elements as dictionary
> keys and values in a collections.Counter object. Finally, it returns a
> dictionary of the most common words and their counts. The
> words.most_common() method gets its argument from the optional top
>   parameter.
> 
> import sysimport collections
> def find_most_common_words(textfile, top=10):
>      ''' Returns the most common words in the textfile.'''
> 
>      textfile = open(textfile)
>      text = textfile.read().lower()
>      textfile.close()

The modern Pythonic way is:-

with open(textfile) as textfile:
     text = textfile.read().lower()

The file close is handled automatically for you.  For those who don't 
know this construct using the "with" keyword is called a context 
manager, here's an article about them 
https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/

>      words = collections.Counter(text.split()) # how often each word appears
> 
>      return dict(words.most_common(top))
> 
> filename = sys.argv[1]

How about some error handling if the user forgets the filename?  The 
Pythonic way is to use a try/except looking for an IndexError, but 
there's nothing wrong with checking the length of sys.argv.

> top_five_words = find_most_common_words(filename, 5)
> 
> I need your comments please.
> 
> Sri

Pretty good all in all :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

---
This email has been checked for viruses by AVG.
http://www.avg.com




More information about the Tutor mailing list