[Tutor] Python

Thu Dec 20 17:34:06 EST 2018

On Thu, Dec 20, 2018 at 10:49:25AM -0500, Mary Sauerland wrote:

> I want to get rid of words that are less than three characters

> f1_name = "/Users/marysauerland/Documents/file1.txt"
> #the opinions
> f2_name = "/Users/marysauerland/Documents/file2.txt"
> #the constitution

Better than comments are meaningful file names:

opinions_filename = "/Users/marysauerland/Documents/file1.txt"
constitution_filename = "/Users/marysauerland/Documents/file2.txt"

> def read_words(words_file):
>     return [word.upper() for line in open(words_file, 'r') for word in line.split()]

Don't try to do too much in a single line of code. While technically 
that should work (I haven't tried it to see that it actually does) it 
would be better written as:

def read_words(words_file):
    with open(words_file, 'r') as f:
        return [word.upper() for line in f for word in line.split()]

This also has the advantage of ensuring that the file is closed after 
the words are read. In your earlier version, it is possible for the file 
to remain locked in an open state.

Note that in this case Python's definition of "word" may not agree with 
the human reader's definition of a word. For example, Python, being 
rather simple-minded, will include punctuation in words so that 

"HELLO"
"HELLO."

count as different words. Oh well, that's something that can be adjusted 
later. For now, let's just go with the simple-minded definition of a 
word, and worry about adjusting it to something more specialised later.

> read_words(f1_name)
> #performs the function on the file

The above line of code (and comment) are pointless. The function is 
called, the file is read, the words are generated, and then immediately 
thrown away. To use the words, you need to assign them to a variable, as 
you do below:

> set1 = set(read_words(f1_name))
> #makes each word into a set and removes duplicate words

A meaningful name is better. Also the comment is inaccurate: it is not 
that *each individual* word is turned into a set, but that the *list* of 
all the words are turned into a set. So better would be:

opinions_words = set(read_words(opinions_filename))
constitition_words = set(read_words(constitution_filename))

This gives us the perfect opportunity to skip short words:

opinions_words = set(
    word for word in read_words(opinions_filename) if len(word) >= 3)
constitition_words = set(
    word for word in read_words(constitution_filename) if len(word) >= 3)

Now you have two sets of unique words, each word guaranteed to be at 
least 3 characters long.

The next thing you try to do is count how many words appear in each set. 
You do it with a double loop:

> count_same_words = 0
> for word in set1:
>     if word in set2:
>         count_same_words += 1

but the brilliant thing about sets is that they already know how to do 
this themselves! Let's see the sorts of operations sets understand:

py> set1 = set("abcdefgh")
py> set2 = set("defghijk")
py> set1 & set2  # the intersection (overlap) of both sets
{'h', 'd', 'f', 'g', 'e'}
py> set1 | set2  # the union (combination) of both sets
{'f', 'd', 'c', 'b', 'h', 'i', 'k', 'j', 'a', 'g', 'e'}
py> set1 ^ set2  # items in one or the other but not both sets
{'i', 'k', 'c', 'b', 'j', 'a'}
py> set1 - set2  # items in set1 but not set2
{'c', 'b', 'a'}

(In the above, "py>" is the Python prompt. On your computer, your prompt 
is probably set to ">>>".)

Can you see which set operation, one of & | ^ or - , you would use to 
get the set of words which appear in both sets? Hint: it isn't the - 
operation. If you wanted to know how many words appear in the 
constitution but NOT in the opinions, you could write:

word_count = len(constitition_words - opinions_words)

Does that give you a hint how to approach this?

Steve