[Tutor] Load Entire File into memory

Steven D'Aprano steve at pearwood.info
Mon Nov 4 18:30:11 CET 2013


On Mon, Nov 04, 2013 at 04:54:16PM +0000, Alan Gauld wrote:
> On 04/11/13 16:34, Amal Thomas wrote:
> >@Joel: The code runs for weeks..input file which I have to process in
> >very huge(in 50 gbs). So its not a matter of hours.its matter of days
> >and weeks..
> 
> OK, but that's not down to reading the file from disk.
> Reading a 50G file will only take a few minutes if you have enough RAM, 
> which seems to be the case.

Not really. There is still some uncertainty (at least in my mind!). For 
instance, I assume that Amal doesn't have sole access to the server. So 
there could be another dozen users all trying to read 50GB files at 
once, in a machine with only 100GB of memory... 

Once the server starts paging, performance will plummett.


> If it's taking days/weeks you must be doing 
> some incredibly time consuming processing.

Well, yes, it's biology :-)



> It's probably worth putting some more timing statements into your code 
> to see where the time is going because it's not the reading from the 
> disk that's the problem.

The first thing I would do is run the code on three smaller sample 
files:

50MB
100MB
200MB

The time taken should approximately double as you double the size of the 
file: say it takes 2 hours to process the 50MB file, 4 hours for the 
100MB file and 8 hours for the 200 MB file, that's linear performance 
and isn't too bad.

But if performance isn't linear, say 2 hours, 4 hours, 16 hours, then 
you're in trouble and you *desperately* need to reconsider the algorithm 
being used. Either that, or just accept that this is an inherently slow 
calculation and it will take a week or two.

Amal, another thing you should try is use the Python profiler on your 
code (again, on a smaller sample file). The profiler will show you where 
the time is being spent.

Unfortunately the profiler may slow your code down, so it is important 
to use it on manageable sized data. The profiler is explained here:

http://docs.python.org/3/library/profile.html

If you need any help, don't hesitate to ask.


> >trying to optimize my code to get the outputs in less time and memory
> >efficiently.
> 
> Memory efficiency is easy, do it line by line off the disk.

This assumes that you can process one line at a time, sequentially. I 
expect that is not the case.


-- 
Steven


More information about the Tutor mailing list