[Tutor] sorting a 2 gb file

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Jan 26 02:27:37 CET 2005



On Tue, 25 Jan 2005, Max Noel wrote:

> >> My data set the below is taken from is over 2.4 gb so speed and
> >> memory considerations come into play.
> >>
> >> Are sets more effective than lists for this?
> >
> > Sets or dictionaries make the act of "lookup" of a key fairly cheap.
> > In the two-pass approach, the first pass can use a dictionary to
> > accumulate the number of times a certain record's key has occurred.
> >
> > Note that, because your file is so large, the dictionary probably
> > shouldn't accumulation the whole mass of information that we've seen
> > so far: instead, it's sufficient to record the information we need to
> > recognize a duplicate.
>
> 	However, the first pass will consume a lot of memory. Considering
> the worst-case scenario where each record only appears once, you'll find
> yourself with the whole 2GB file loaded into memory.
> 	(or do you have a "smarter" way to do this?)


Hi Max,

My assumptions are that each record consists of some identifying string
"key" that's associated to some "value".  How are we deciding that two
records are talking about the same thing?


I'm hoping that the set of unique keys isn't itself very large.  Under
this assumption, we can do something like this:

###
from sets import Set
def firstPass(f):
    """Returns a set of the duplicate keys in f."""
    seenKeys = Set()
    duplicateKeys = Set()
    for record in f:
        key = extractKey(record)
        if key in seenKeys:
            duplicateKeys.add(key)
        else:
            seenKeys.add(key)
    return duplicateKeys
###

where we don't store the whole record into memory, but only the 'key'
portion of the record.

And if the number of unique keys is small enough, this should be fine
enough to recognize duplicate records.  So on the second passthrough, we
can display the duplicate records on-the-fly.  If this assumption is not
true, then we need to do something else.  *grin*

One possibility might be to implement an external sorting mechanism:

    http://www.nist.gov/dads/HTML/externalsort.html


But if we're willing to do an external sort, then we're already doing
enough work that we should really consider using a DBMS.  The more
complicated the data management becomes, the more attractive it becomes to
use a real database to handle these data management issues.  We're trying
to solve a problem that is already solved by a real database management
system.


Talk to you later!



More information about the Tutor mailing list