[Tutor] Reading/dealing/matching with truly huge (ascii) files

Wed Feb 22 10:50:56 CET 2012

Elaina Ann Hyde wrote:

> So, Python question of the day:  I have 2 files that I could normally just
> read in with asciitable, The first file is a 12 column 8000 row table that
> I have read in via asciitable and manipulated.  The second file is
> enormous, has over 50,000 rows and about 20 columns.  What I want to do is
> find the best match for (file 1 column 1 and 2) with (file 2 column 4 and
> 5), return all rows that match from the huge file, join them togeather and
> save the whole mess as a file with 8000 rows (assuming the smaller table
> finds one match per row) and 32=12+20 columns.  So my read code so far is
> as follows:
> -------------------------------------------------
> import sys
> import asciitable
> import matplotlib
> import scipy
> import numpy as np
> from numpy import *
> import math
> import pylab
> import random
> from pylab import *
> import astropysics
> import astropysics.obstools
> import astropysics.coords
> 
> x=small_file
> #cannot read blank values (string!) if blank insert -999.99
> dat=asciitable.read(x,Reader=asciitable.CommentedHeader,
> fill_values=['','-999.99'])
> y=large_file
> fopen2=open('cfile2match.list','w')
> dat2=asciitable.read(y,Reader=asciitable.CommentedHeader,
> fill_values=['','-999.99'])
> #here are the 2 values for the small file
> Radeg=dat['ra-drad']*180./math.pi
> Decdeg=dat['dec-drad']*180./math.pi
> 
> #here are the 2 values for the large file
> Radeg2=dat2['ra-drad']*180./math.pi
> Decdeg2=dat2['dec-drad']*180./math.pi
> 
> for i in xrange(len(Radeg)):
>          for j in xrange(len(Radeg2)):
> #select the value if it is very, very, very close
>                 if i != j and Radeg[i] <= (Radeg2[j]+0.000001) and
>                 Radeg[i]
>>= (Radeg2[j]-0.000001) and Decdeg[i] <= (Decdeg2[j]+0.000001) and
> Decdeg[i] >= (Decdeg2[j]-0.000001):
>                 fopen.write( "     ".join([str(k) for k in
> list(dat[i])])+"     "+"     ".join([str(k) for k in list(dat[j])])+"\n")
> -------------------------------------------
> Now this is where I had to stop, this is way, way too long and messy.  I
> did a similar approach with smaller (9000 lines each) files and it worked
> but took awhile, the problem here is I am going to have to play with the
> match range to return the best result and give only one (1!) match per row
> for my smaller file, i.e. row 1 of small file must match only 1 row of
> large file..... then I just need to return them both.  However, it isn't
> clear to me that this is the best way forward.  I have been changing the
> xrange to low values to play with the matching, but I would appreciate any
> ideas.  Thanks

If you calculate the distance instead of checking if it's under a certain 
threshold you are guaranteed to get (one of the) best matches.
Pseudo-code:

from functools import partial
big_rows = read_big_file_into_memory()

def distance(small_row, big_row):
    ...

for small_row in read_small_file():
    best_match = min(big_rows, key=partial(dist, small_row))
    write_to_result_file(best_match)

As to the actual implementation of the distance() function, I don't 
understand your problem description (two columns in the first, three in the 
second, how does that work), but generally 

a, c = extract_columns_from_small_row(small_row)
b, d = extract_columns_from_big_row(big_row)
if (a <= b + eps) and (c <= d + eps):
   # it's good

would typically become

distance(small_row, big_row):
    a, c = extract_columns_from_small_row(small_row)
    b, d = extract_columns_from_big_row(big_row)
    x = a-b
    y = c-d
    return math.sqrt(x*x+y*y)