[Tutor] Extracting words.. [sort | uniq]

Sun, 24 Mar 2002 13:48:23 -0800 (PST)

> | My first question:
> |
> | What do I have to do that each word appears only once in the list,i.e. is
> | found only once??
>
> Strings are a hashable object, and a dict can have each key occur only
> once.  In addition, key lookups are fast.  This will filter out
> duplicates :
>
> l = ['JavaScript', 'MacWeek', 'MacWeek', 'CompuServe', 'CompuServe' ]
> d = {}
> for item in l :
>     d[item] = None
> l = d.keys()
> print l
>
> The value associated with the key is irrelevant in this usage.  You
> could also do it with lists, but the execution time grows
> exponentially with the data size :
>
> l = ['JavaScript', 'MacWeek', 'MacWeek', 'CompuServe', 'CompuServe' ]
> m = []
> for item in l :
>     if item not in m :
>         m.append( item )
> l = m
> print l

To offer a counterpoint: the list structure doesn't stop us from finding
unique elements effectively.  We can also do this uniqueness filter by
first sorting the words.  What ths does is bring all the duplicates right
next to each other.  Once we have this sorted list, just taking its unique
members is just a matter of picking them out:

###
def uniq(mylist):
    """Returns a unique list of elements in mylist.

    Warning: this function also has a side effect of sorting mylist."""
    mylist.sort()
    if not mylist: return mylist
    results = [mylist[0]]
    for i in range(1, len(mylist)):
        if mylist[i-1] != mylist[i]:
            results.append(mylist[i])
    return results
###

Here's a test drive:

###
>>> words = """hello this is a test of the emergency
... broadcast system this is only a test""".split()
>>> uniq(words)
['a', 'broadcast', 'emergency', 'hello', 'is',
 'of', 'only', 'system', 'test', 'the', 'this']
###

People who run Unix will recognize this as the "sort | uniq" approach.

For people who are interested: here's a similar problem: "Given a text
file and an integer K, you are to print the K most common words in the
file (and the number of their occurences) in decreasing frequency."

It's a fun problem to work on, and useful when one is writing an essay and
wants to avoid overusing words.  *grin*

Hope this helps!