[Tutor] need to get unique elements out of a 2.5Gb file

Alan Gauld alan.gauld at freenet.co.uk
Thu Feb 2 10:32:02 CET 2006


Hi,

<  I have a file which is 2.5 Gb.,
> 
> There are many duplicate lines.  I wanted to get rid
> of the duplicates.

First, can you use uniq which is a standard Unix/Linux OS command?

> I chose to parse to get uniqe element.
> 
> f1 = open('mfile','r')
> da = f1.read().split('\n')

This reads 2.5G of data into memory. Do you have 2.5G of 
available memory?

It then splits it into lines, so why not read the file line by line 
instead?

for da in open('myfile'):
    stuff here

> dat = da[:-1]

This creates a copy of the file contents - anbother 2.5GB!
if you used da = da[:-1]  you would only have one version.

However if you read it one line at a time you can go direct 
to putting it into the Set which means you never reach 
the 2.5GB size.

> f2 = open('res','w')
> dset = Set(dat)
> for i in dset:
>    f2.write(i)
>    f2.write('\n')

f2.write(i+'\n')

should be slightly faster and with this size of data set that 
probably is a visible difference!

> Problem: Python says it cannot hande such a large
> file. 

Thats probably not a Python issue but an available RAM issue.
But your code doesn't need the entire file in RAM so just read 
one line at a time and avoid the list..

If its still too big you can try batching the operations. 
Only process half the lines in the file say, then merge 
the resultant reduced files. The key point is that without 
resort to much more sophisticated algorithms you must 
at some point hold the final data set in RAM, if it is too 
big the program will fail.

A final strategy is to sort the file (which can be 
done - slowly! - in batches and remove duplicate lines 
afterwards, or even as part of the sort... But if you need 
to go that far come back for more details.

HTH,

Alan G.


More information about the Tutor mailing list