[Tutor] Reading/dealing/matching with truly huge (ascii) files

Thu Feb 23 02:55:40 CET 2012

On Wed, Feb 22, 2012 at 8:50 PM, Peter Otten <__peter__ at web.de> wrote:

> Elaina Ann Hyde wrote:
>
> > So, Python question of the day:  I have 2 files that I could normally
> just
> > read in with asciitable, The first file is a 12 column 8000 row table
> that
> > I have read in via asciitable and manipulated.  The second file is
> > enormous, has over 50,000 rows and about 20 columns.  What I want to do
> is
> > find the best match for (file 1 column 1 and 2) with (file 2 column 4 and
> > 5), return all rows that match from the huge file, join them togeather
> and
> > save the whole mess as a file with 8000 rows (assuming the smaller table
> > finds one match per row) and 32=12+20 columns.  So my read code so far is
> > as follows:
> > -------------------------------------------------
> > import sys
> > import asciitable
> > import matplotlib
> > import scipy
> > import numpy as np
> > from numpy import *
> > import math
> > import pylab
> > import random
> > from pylab import *
> > import astropysics
> > import astropysics.obstools
> > import astropysics.coords
> >
> > x=small_file
> > #cannot read blank values (string!) if blank insert -999.99
> > dat=asciitable.read(x,Reader=asciitable.CommentedHeader,
> > fill_values=['','-999.99'])
> > y=large_file
> > fopen2=open('cfile2match.list','w')
> > dat2=asciitable.read(y,Reader=asciitable.CommentedHeader,
> > fill_values=['','-999.99'])
> > #here are the 2 values for the small file
> > Radeg=dat['ra-drad']*180./math.pi
> > Decdeg=dat['dec-drad']*180./math.pi
> >
> > #here are the 2 values for the large file
> > Radeg2=dat2['ra-drad']*180./math.pi
> > Decdeg2=dat2['dec-drad']*180./math.pi
> >
> > for i in xrange(len(Radeg)):
> >          for j in xrange(len(Radeg2)):
> > #select the value if it is very, very, very close
> >                 if i != j and Radeg[i] <= (Radeg2[j]+0.000001) and
> >                 Radeg[i]
> >>= (Radeg2[j]-0.000001) and Decdeg[i] <= (Decdeg2[j]+0.000001) and
> > Decdeg[i] >= (Decdeg2[j]-0.000001):
> >                 fopen.write( "     ".join([str(k) for k in
> > list(dat[i])])+"     "+"     ".join([str(k) for k in list(dat[j])])+"\n")
> > -------------------------------------------
> > Now this is where I had to stop, this is way, way too long and messy.  I
> > did a similar approach with smaller (9000 lines each) files and it worked
> > but took awhile, the problem here is I am going to have to play with the
> > match range to return the best result and give only one (1!) match per
> row
> > for my smaller file, i.e. row 1 of small file must match only 1 row of
> > large file..... then I just need to return them both.  However, it isn't
> > clear to me that this is the best way forward.  I have been changing the
> > xrange to low values to play with the matching, but I would appreciate
> any
> > ideas.  Thanks
>
> If you calculate the distance instead of checking if it's under a certain
> threshold you are guaranteed to get (one of the) best matches.
> Pseudo-code:
>
> from functools import partial
> big_rows = read_big_file_into_memory()
>
> def distance(small_row, big_row):
>    ...
>
> for small_row in read_small_file():
>    best_match = min(big_rows, key=partial(dist, small_row))
>    write_to_result_file(best_match)
>
>
> As to the actual implementation of the distance() function, I don't
> understand your problem description (two columns in the first, three in the
> second, how does that work), but generally
>
> a, c = extract_columns_from_small_row(small_row)
> b, d = extract_columns_from_big_row(big_row)
> if (a <= b + eps) and (c <= d + eps):
>   # it's good
>
> would typically become
>
> distance(small_row, big_row):
>    a, c = extract_columns_from_small_row(small_row)
>    b, d = extract_columns_from_big_row(big_row)
>    x = a-b
>    y = c-d
>    return math.sqrt(x*x+y*y)
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>

Thanks for all the helpful hints, I really like the idea of using distances
instead of a limit.  Walter was right that the 'i !=j' condition was
causing problems.  I think that Alan and Steven's use of the index
separately was great as it makes this much easier to test (and yes
'astropysics' is a valid package, it's in there for later when I convert
astrophysical coordinates and whatnot, pretty great but a little buggy
FYI).  So I thought, hey, why not try to do a little of all these ideas,
and, if you'll forgive the change in syntax, I think the problem is that
the file might really just be too big to handle, and I'm not sure I have
the right idea with the best_match:
-----------------------------------
#!/usr/bin/python

import sys
import asciitable
import matplotlib
import scipy
import numpy as np
import math
import pylab
import random
from pylab import *
import astropysics
import astropysics.obstools
import astropysics.coords
from astropysics.coords import ICRSCoordinates,GalacticCoordinates

#small
x=open('allfilematch.list')

#really big 2MASS file called 'sgr_2df_big.list'
y=open('/Volumes/Diemos/sgr_2df_big.list')

dat=asciitable.read(x,Reader=asciitable.CommentedHeader,
fill_values=['','-999.99'])
dat2=asciitable.read(y,Reader=asciitable.NoHeader,
start_line=4,fill_values=['nan','-999.99'])

fopen=open('allfiles_rod2Mass.list','w')

#first convert from decimal radians to degrees
Radeg=dat['ra-drad']*180./math.pi
Decdeg=dat['dec-drad']*180./math.pi

#here are the 2 values for the large file
#converts hexadecimal in multiple columns to regular degrees
Radeg2=15*(dat2['col1']+(dat2['col2']/60.)+(dat2['col3']/(60.*60.)))
Decdeg2=dat2['col4']+(dat2['col5']/60.)+(dat2['col6']/(60.*60.))

#try defining distances instead of a limit...
def distance(dat, dat2):
   x = Radeg - Radeg2
   y = Decdeg - Decdeg2
   return np.sqrt(x*x+y*y)

for i in xrange(len(Radeg)):
    best_match=min(Radeg2,key=partial(dist,Radeg))
    fopen.write(best_match)

fopen.close()
---------------
The errors are as follows:
---------------------
Python(4085,0xa01d3540) malloc: *** mmap(size=2097152) failed (error
code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "read_2MASS.py", line 38, in <module>

dat2=asciitable.read(y,Reader=asciitable.NoHeader,data_start=4,fill_values=['nan','-9.999'])
  File
"/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/ui.py",
line 131, in read
    dat = _guess(table, new_kwargs)
  File
"/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/ui.py",
line 175, in _guess
    dat = reader.read(table)
  File
"/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/core.py",
line 841, in read
    self.lines = self.inputter.get_lines(table)
  File
"/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/core.py",
line 158, in get_lines
    lines = table.splitlines()
MemoryError
----------------------
So this means I don't have enough memory to run through the large file?
Even if I just read in with asciitable I get this problem, I looked again
and the large file is 1.5GB of text lines, so very large.  I was thinking
of trying to tell the read function to skip lines that are too far away,
the file is much, much bigger than the area I need.  Thanks for the
comments so far.
~Elaina

-- 
PhD Candidate
Department of Physics and Astronomy
Faculty of Science
Macquarie University
North Ryde, NSW 2109, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120223/829c12b1/attachment-0001.html>