[Tutor] Reading/dealing/matching with truly huge (ascii) files
Alan Gauld
alan.gauld at btinternet.com
Wed Feb 22 09:31:23 CET 2012
On 22/02/12 05:44, Elaina Ann Hyde wrote:
> file is enormous, has over 50,000 rows and about 20 columns.
On modern computers its not that enormous - probably around 10M?
But there are techniques for this which we can cover another time is you
do hit files bigger than fit in memory.
I didn't go through the code in detail. but...
e = 0.000001
if i != j and
Radeg[i] <= (Radeg2[j]+e) and
Radeg[i] >= (Radeg2[j]-e) and
Decdeg[i] <= (Decdeg2[j]+e) and
Decdeg[i] >= (Decdeg2[j]-e):
Using e helps tune the precision as needed.
That layout style will help you see the logic more easily.
But in Python you can tidy that up even more by rewriting it like
if i != j and
(Radeg2[j]-e) <= Radeg[i] <= (Radeg2[j]+e) and
(Decdeg2[j]-e) <= Decdeg[i] <= (Decdeg2[j]+e):
And you could put it in a function to further control readability of the
main program and encapsulate the tests.
def rowsEquate(row1,row2, i, j):...
if i != j and
rowsEquate(Radeg,Radeg2,i,j) and
rowsEquate(Deceg, Deceg2,i,j):
> fopen.write( " ".join([str(k) for k in list(dat[i])])+"
> "+" ".join([str(k) for k in list(dat[j])])+"\n")
I may be wrong but it looks like something wrong with the quoting there?
The last quote on the first line, after the +?
> -------------------------------------------
> Now this is where I had to stop, this is way, way too long and messy.
Its not really that long or messy, but it could be tidied up a little.
> did a similar approach with smaller (9000 lines each) files and it
> worked but took awhile,
This might be the biggest problem, it will take a long time on big files.
Personally I would tend to tackle a problem like this using a
database and write a query to select the rows that match in one
operation. Especially if I had to process a lot of files.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
More information about the Tutor
mailing list