[Tutor] Load Entire File into memory

Steven D'Aprano steve at pearwood.info
Mon Nov 4 18:11:52 CET 2013


On Mon, Nov 04, 2013 at 11:27:52AM -0500, Joel Goldstick wrote:

> If you are new to python why are you so concerned about the speed of
> your code.

Amal is new to Python but he's not new to biology, he's a 4th year 
student. With a 50GB file, I expect he is analysing something to do with 
DNA sequencing, which depending on exactly what he is trying to do could 
involve O(N) or even O(N**2) algorithms. An O(N) algorithm on a 50GB 
file, assuming 100,000 steps per second, will take over 5 days to 
complete. An O(N**2) algorithm, well, it's nearly unthinkable: nearly 
800 million years. You *really* don't want O(N**2) algorithms with big 
data.

I would expect that with a big DNA sequencing problem, running time 
would be measured in days rather than minutes or hours. So yes, this is 
probably a case where optimizing for speed is not premature.

We really don't know enough about his problem to advise him on how to 
speed it up. If the data file is guaranteed to be nothing but GCTA 
bases, and newlines, it may be better to read the data file into memory 
as a bytearray rather than a string. Especially if he needs to modify it 
in place. But this is getting into some fairly advanced territory, I 
wouldn't like to predict what will be faster without testing on real 
data.


-- 
Steven


More information about the Tutor mailing list