[Tutor] Question about nested FOR looping
Michael Janssen
Janssen@rz.uni-frankfurt.de
Wed Apr 30 15:25:02 2003
On Wed, 30 Apr 2003 stuart_clemons@us.ibm.com wrote:
> So, as a follow-up out of curiosity, which is the most efficient:
>
> 1) Read the two files into lists and then process the lists, or
with (vary) large files this is forbidden because of memory-consumption.
With reasonable small files it can be handy espessially if you need random
access (not looping line-by-line).
> 2) Process directly from the file reads, using seek in the nested FOR, or
*can* be good for memory. NB: "for line in fo.readlines():" build a
(possibly) huge list of lines within memory. fo.xreadlines() is better or
recent-version-of-python "for line in fo:"
> 3) Read one file into a list and process during the read of the second
> file ?
depends on wich is the big file.
The problem with all these approaches is, that it iter over len(list1) x
len(list2). With large lists you will have many comparisions (comparing
e.g. 10,000 x 10,000 entries *this way* can easily last for some minutes).
Since dictinary lookup is fast it is much better to store the lines of
first file in a dictionary and look for each line of the second file if
the dictionary "has_key(line)" and report the duplicates.
> As some added info, Scott Widney from tutor@python.org, provided a
> dictionary construct for this same problem, ie, compare two lists,
> identifying the duplicates. I would guess that the dictionary construct
> would be the most efficient method, based on random vs sequential
> ordering. Is this right ?
Yes, because dictionary lookup is fast because of random ordering.
say len(list1) --> 10
len(list2) --> 20
for line in list1: # iters 10 times
for line2 in list2: # iters 10 x 20 times
if line == line2: # 200 comparisions
pass
dict1 = dict([(x, 1) for x in list1]) # builds a dictionary from list1
for line in list2: # iters 20 times
if dict1.has_key(line): # 20 lookups
pass
NB:
for line in list1: # iters 10 times
if line in list1: # 10 lookups
pass
is possibly slower since looups in lists are slow. Could well be that this
is as slow as the dubble-for-loop solution in case the "if line in list1"
lookup works internally as a iteration through each element.
Michael