[Tutor] need to get unique elements out of a 2.5Gb file

Rinzwind w.damen at gmail.com
Thu Feb 2 09:59:57 CET 2006

I'd use a database if I was you.
Install for instance MYSQL or MudBase or something like that and (if
need be use Python) to insert the lines into the database. Only
storing unique lines would be failry easy.

Other sollution (with the usage of Python):
If you must use Python I'd suggest making new smaller files.
How about making files that are named with the 1st letter of each line
you find and split your file up into as many parts and your lines
start with unique characters.

You end up with lots of smaller files that just need to be merged
together (could be done with 'cat' I think?) or you could read all
those files and toss them back into 1 big file.

Something like this:

Read the 2,5G masterfile
Read lines 1 by 1
Make a new file named "1st char of line founnd".txt if it doesn't
exist and add the new line
otherwise scan this file and see if the line is not there yet and if
not there add it.

when done: merge all files.

Can't be too hard to pull off ;)

More information about the Tutor mailing list