Sort Big File Help

mk mrkafk at gmail.com
Wed Mar 3 13:20:00 EST 2010


John Filben wrote:
> I am new to Python but have used many other (mostly dead) languages in 
> the past.  I want to be able to process *.txt and *.csv files.  I can 
> now read that and then change them as needed – mostly just take a column 
> and do some if-then to create a new variable.  My problem is sorting 
> these files:
> 
> 1.)    How do I sort file1.txt by position and write out 
> file1_sorted.txt; for example, if all the records are 100 bytes long and 
> there is a three digit id in the position 0-2; here would be some sample 
> data:
> 
> a.       001JohnFilben……
> 
> b.      002Joe  Smith…..

Use a dictionary:

linedict = {}
for line in f:
	key = line[:3]
	linedict[key] = line[3:] # or alternatively 'line' if you want to 
include key in the line anyway

sortedlines = []
for key in linedict.keys().sort():
	sortedlines.append(linedict[key])

(untested)

This is the simplest, and probably inefficient approach. But it should work.

> 
> 2.)    How do I sort file1.csv by column name; for example, if all the 
> records have three column headings, “id”, “first_name”, “last_name”; 
>  here would be some sample data:
> 
> a.       Id, first_name,last_name
> 
> b.      001,John,Filben
> 
> c.       002,Joe, Smith

This is more complicated: I would make a list of lines, where each line 
is a list split according to columns (like ['001', 'John', 'Filben']), 
and then I would sort this list using operator.itemgetter, like this:

lines.sort(key = operator.itemgetter(num)) # where num is the number of 
column, starting with 0 of course

Read up on operator.*, it's very useful.


> 
> 3.)    What about if I have millions of records and I am processing on a 
> laptop with a large external drive – basically, are there space 
> considerations? What are the work arounds.

The simplest is to use smth like SQLite: define a table, fill it up, and 
then do SELECT with ORDER BY.

But with a million records I wouldn't worry about it, it should fit in 
RAM. Observe:

 >>> a={}
 >>> for i in range(1000000):
...     a[i] = 'spam'*10
...
 >>> sys.getsizeof(a)
25165960

So that's what, 25 MB?

Although I have to note that TEMPORARY ram usage in Python process on my 
machine did go up to 113MB.

Regards,
mk







More information about the Python-list mailing list