[Tutor] Extracting words..
dman
dsh8290@rit.edu
Sun, 24 Mar 2002 15:04:43 -0600
On Sun, Mar 24, 2002 at 09:39:18PM +0100, Nicole Seitz wrote:
I'll answer the first part since that is easy :-).
| -->returns a list containing such words, e.g.
|
| ['JavaScript', 'MacWeek', 'MacWeek', 'CompuServe', 'CompuServe',
| 'CompuServe', 'CompuServe', 'CompuServe', 'SysOps', 'SysOps', 'CompuServe',
| 'CompuServe', 'CompuServe', 'InterBus', 'NeuroVisionen', 'NeuroVisionen',
| 'InterBus']
|
| My first question:
|
| What do I have to do that each word appears only once in the list,i.e. is
| found only once??
Strings are a hashable object, and a dict can have each key occur only
once. In addition, key lookups are fast. This will filter out
duplicates :
l = ['JavaScript', 'MacWeek', 'MacWeek', 'CompuServe', 'CompuServe' ]
d = {}
for item in l :
d[item] = None
l = d.keys()
print l
The value associated with the key is irrelevant in this usage. You
could also do it with lists, but the execution time grows
exponentially with the data size :
l = ['JavaScript', 'MacWeek', 'MacWeek', 'CompuServe', 'CompuServe' ]
m = []
for item in l :
if item not in m :
m.append( item )
l = m
print l
The two problems with this approach are :
o a linear search to see if the item is a duplicate
o lists are implented as a C array, when you run out of space, a
new chunk of memory must be allocated and all the old data copied
over. You don't need to worry about this in your code (the
interpreter takes care of it) but it does affect the
performance of certain algorithms.
To briefly touch on your second question, I recommend modifying your
method to build a dict and return it instead of a list. You can store
whatever data you like (line numbers, etc) as the value in the dict.
This will reduce the overall overhead since you won't be first
building a list with duplicates, then building a dict to eliminate
them, then converting it back to a list.
HTH,
-D
--
The truly righteous man attains life,
but he who pursues evil goes to his death.
Proverbs 11:19