[Tutor] Reading/dealing/matching with truly huge (ascii) files

Wed Feb 22 06:44:57 CET 2012

So, Python question of the day:  I have 2 files that I could normally just
read in with asciitable, The first file is a 12 column 8000 row table that
I have read in via asciitable and manipulated.  The second file is
enormous, has over 50,000 rows and about 20 columns.  What I want to do is
find the best match for (file 1 column 1 and 2) with (file 2 column 4 and
5), return all rows that match from the huge file, join them togeather and
save the whole mess as a file with 8000 rows (assuming the smaller table
finds one match per row) and 32=12+20 columns.  So my read code so far is
as follows:
-------------------------------------------------
import sys
import asciitable
import matplotlib
import scipy
import numpy as np
from numpy import *
import math
import pylab
import random
from pylab import *
import astropysics
import astropysics.obstools
import astropysics.coords

x=small_file
#cannot read blank values (string!) if blank insert -999.99
dat=asciitable.read(x,Reader=asciitable.CommentedHeader,
fill_values=['','-999.99'])
y=large_file
fopen2=open('cfile2match.list','w')
dat2=asciitable.read(y,Reader=asciitable.CommentedHeader,
fill_values=['','-999.99'])
#here are the 2 values for the small file
Radeg=dat['ra-drad']*180./math.pi
Decdeg=dat['dec-drad']*180./math.pi

#here are the 2 values for the large file
Radeg2=dat2['ra-drad']*180./math.pi
Decdeg2=dat2['dec-drad']*180./math.pi

for i in xrange(len(Radeg)):
         for j in xrange(len(Radeg2)):
#select the value if it is very, very, very close
                if i != j and Radeg[i] <= (Radeg2[j]+0.000001) and Radeg[i]
>= (Radeg2[j]-0.000001) and Decdeg[i] <= (Decdeg2[j]+0.000001) and
Decdeg[i] >= (Decdeg2[j]-0.000001):
                fopen.write( "     ".join([str(k) for k in
list(dat[i])])+"     "+"     ".join([str(k) for k in list(dat[j])])+"\n")
-------------------------------------------
Now this is where I had to stop, this is way, way too long and messy.  I
did a similar approach with smaller (9000 lines each) files and it worked
but took awhile, the problem here is I am going to have to play with the
match range to return the best result and give only one (1!) match per row
for my smaller file, i.e. row 1 of small file must match only 1 row of
large file..... then I just need to return them both.  However, it isn't
clear to me that this is the best way forward.  I have been changing the
xrange to low values to play with the matching, but I would appreciate any
ideas.  Thanks
~Elaina
-- 
PhD Candidate
Department of Physics and Astronomy
Faculty of Science
Macquarie University
North Ryde, NSW 2109, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120222/53a61c5d/attachment-0001.html>