[Tutor] Reading/dealing/matching with truly huge (ascii) files

Wed Feb 22 09:31:23 CET 2012

On 22/02/12 05:44, Elaina Ann Hyde wrote:

> file is enormous, has over 50,000 rows and about 20 columns.

On modern computers its not that enormous - probably around 10M?
But there are techniques for this which we can cover another time is you 
do hit files bigger than fit in memory.

I didn't go through the code in detail. but...

e = 0.000001
if i != j and
    Radeg[i] <= (Radeg2[j]+e) and
    Radeg[i] >= (Radeg2[j]-e) and
    Decdeg[i] <= (Decdeg2[j]+e) and
    Decdeg[i] >= (Decdeg2[j]-e):

Using e helps tune the precision as needed.

That layout style will help you see the logic more easily.

But in Python you can tidy that up even more by rewriting it like

if i != j and
   (Radeg2[j]-e) <= Radeg[i] <= (Radeg2[j]+e) and
   (Decdeg2[j]-e) <= Decdeg[i] <= (Decdeg2[j]+e):

And you could put it in a function to further control readability of the 
main program and encapsulate the tests.

def rowsEquate(row1,row2, i, j):...

if i != j and
    rowsEquate(Radeg,Radeg2,i,j) and
    rowsEquate(Deceg, Deceg2,i,j):

> fopen.write( " ".join([str(k) for k in list(dat[i])])+"
> "+" ".join([str(k) for k in list(dat[j])])+"\n")

I may be wrong but it looks like something wrong with the quoting there?
The last quote on the first line, after the +?

> -------------------------------------------
> Now this is where I had to stop, this is way, way too long and messy.

Its not really that long or messy, but it could be tidied up a little.

> did a similar approach with smaller (9000 lines each) files and it
> worked but took awhile,

This might be the biggest problem, it will take a long time on big files.

Personally I would tend to tackle a problem like this using a
database and write a query to select the rows that match in one 
operation. Especially if I had to process a lot of files.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/