[Tutor] Load Entire File into memory
Steven D'Aprano
steve at pearwood.info
Mon Nov 4 18:30:11 CET 2013
On Mon, Nov 04, 2013 at 04:54:16PM +0000, Alan Gauld wrote:
> On 04/11/13 16:34, Amal Thomas wrote:
> >@Joel: The code runs for weeks..input file which I have to process in
> >very huge(in 50 gbs). So its not a matter of hours.its matter of days
> >and weeks..
>
> OK, but that's not down to reading the file from disk.
> Reading a 50G file will only take a few minutes if you have enough RAM,
> which seems to be the case.
Not really. There is still some uncertainty (at least in my mind!). For
instance, I assume that Amal doesn't have sole access to the server. So
there could be another dozen users all trying to read 50GB files at
once, in a machine with only 100GB of memory...
Once the server starts paging, performance will plummett.
> If it's taking days/weeks you must be doing
> some incredibly time consuming processing.
Well, yes, it's biology :-)
> It's probably worth putting some more timing statements into your code
> to see where the time is going because it's not the reading from the
> disk that's the problem.
The first thing I would do is run the code on three smaller sample
files:
50MB
100MB
200MB
The time taken should approximately double as you double the size of the
file: say it takes 2 hours to process the 50MB file, 4 hours for the
100MB file and 8 hours for the 200 MB file, that's linear performance
and isn't too bad.
But if performance isn't linear, say 2 hours, 4 hours, 16 hours, then
you're in trouble and you *desperately* need to reconsider the algorithm
being used. Either that, or just accept that this is an inherently slow
calculation and it will take a week or two.
Amal, another thing you should try is use the Python profiler on your
code (again, on a smaller sample file). The profiler will show you where
the time is being spent.
Unfortunately the profiler may slow your code down, so it is important
to use it on manageable sized data. The profiler is explained here:
http://docs.python.org/3/library/profile.html
If you need any help, don't hesitate to ask.
> >trying to optimize my code to get the outputs in less time and memory
> >efficiently.
>
> Memory efficiency is easy, do it line by line off the disk.
This assumes that you can process one line at a time, sequentially. I
expect that is not the case.
--
Steven
More information about the Tutor
mailing list