[Tutor] Reading/dealing/matching with truly huge (ascii) files

Elaina Ann Hyde elainahyde at gmail.com
Fri Feb 24 06:11:36 CET 2012


On Thu, Feb 23, 2012 at 9:07 PM, Alan Gauld <alan.gauld at btinternet.com>wrote:

> On 23/02/12 01:55, Elaina Ann Hyde wrote:
> ns/7.2/lib/python2.7/site-**packages/asciitable-0.8.0-py2.**7.egg/asciitable/core.py",
>
>
>> line 158, in get_lines
>>     lines = table.splitlines()
>> MemoryError
>> ----------------------
>> So this means I don't have enough memory to run through the large file?
>>
>
> Probably, or the code you are using is doing something extremely
> inefficient.
>
>
>  Even if I just read in with asciitable I get this problem, I looked
>> again and the large file is 1.5GB of text lines, so very large.
>>
>
> How much RAM do you have? Probably only 1-2G? so I'd suggest trying
> another approach.
>
> Peter has suggested a couple of ideas.
>
> The other way is to simply load both files into database tables and use a
> SQL SELECT to pull out the combined lines. This will probably be faster
> than trying to do line by line stitch ups in Python.
>
> You can also use the SQL interactive prompt to experiment with the query
> till you are sure its right!
>
> Do you know any SQL? If not it is very easy to learn.
> (See the database topic in my tutorial(v2 only) )
>
>
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
>
> ______________________________**_________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/**mailman/listinfo/tutor<http://mail.python.org/mailman/listinfo/tutor>
>

Ok, if I use awk I seperate the file into an edible 240MB chunk, I do my
initial sorting there.  Now, having learned my lesson from last time, using
numpy is/can be faster than looping for an array, so if I want to find the
minimum distance and get matches.  I cobbled these together and now the
matching is reasonably fast and seems to be doing quite well.
------------------------------------------------------------------------
#!/usr/bin/python

# import modules used here -- sys is a very standard one
import sys
import asciitable
import matplotlib
import matplotlib.path as mpath
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure, show
from matplotlib.patches import Ellipse
import scipy
import numpy as np
from numpy import *
import math
import pylab
import random
from pylab import *
import astropysics
import astropysics.obstools
import astropysics.coords
from astropysics.coords import ICRSCoordinates,GalacticCoordinates

x=open('Core_rod_name.list')
y=open('2MASS_subsetJKmikesigs18_19_36_27')

dat=asciitable.read(x,Reader=asciitable.CommentedHeader,
fill_values=['','-999.99'])

#first convert from decimal radians to degrees
Radeg=dat['ra-drad']*180./math.pi
Decdeg=dat['dec-drad']*180./math.pi

dat2 = asciitable.read(y,Reader=asciitable.NoHeader,fill_values=['
nan','-99.9'])
fopen = open('allfiles_rod2Mass.list','w')

#here are the 2 values for the large file
#converts hexadecimal in multiple columns to regular degrees
Radeg2 = 15*(dat2['col1']+(dat2['col2']/60.)+(dat2['col3']/(60.*60.)))
Decdeg2 = dat2['col4']-(dat2['col5']/60.)-(dat2['col6']/(60.*60.))

#try defining distances instead of a limit
#built in numpy function faster than a loop, combine numpy and loop
def distance(Ra1,Dec1,Ra2,Dec2):
    x = Ra1 - Ra2
    y = Dec1 - Dec2
    return np.sqrt(x*x+y*y)

fopen=open('matches.list','w')
best_match=[]
for i in xrange(len(Radeg)):
    dist2 = np.array(distance(Radeg[i],Decdeg[i],Radeg2,Decdeg2))
    best_match = where(dist2==min(dist2))[0][0]
    Rab = Radeg2[best_match]
    Decb = Decdeg2[best_match]
    fopen.write(str(dist2[best_match])+"    "+"     ".join([str(k) for k in
list(dat[i])])+"     "+"     ".join([str(k) for k in
list(dat2[best_match])])+"\n")

fopen.close()
----------------------------------------------------------
Thanks everyone!  All your comments were really helpful, I think I might
even be getting the hang of this!
~Elaina

-- 
PhD Candidate
Department of Physics and Astronomy
Faculty of Science
Macquarie University
North Ryde, NSW 2109, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120224/7cb0c94d/attachment-0001.html>


More information about the Tutor mailing list