[Tutor] Load Entire File into memory
Steven D'Aprano
steve at pearwood.info
Mon Nov 4 18:11:52 CET 2013
On Mon, Nov 04, 2013 at 11:27:52AM -0500, Joel Goldstick wrote:
> If you are new to python why are you so concerned about the speed of
> your code.
Amal is new to Python but he's not new to biology, he's a 4th year
student. With a 50GB file, I expect he is analysing something to do with
DNA sequencing, which depending on exactly what he is trying to do could
involve O(N) or even O(N**2) algorithms. An O(N) algorithm on a 50GB
file, assuming 100,000 steps per second, will take over 5 days to
complete. An O(N**2) algorithm, well, it's nearly unthinkable: nearly
800 million years. You *really* don't want O(N**2) algorithms with big
data.
I would expect that with a big DNA sequencing problem, running time
would be measured in days rather than minutes or hours. So yes, this is
probably a case where optimizing for speed is not premature.
We really don't know enough about his problem to advise him on how to
speed it up. If the data file is guaranteed to be nothing but GCTA
bases, and newlines, it may be better to read the data file into memory
as a bytearray rather than a string. Especially if he needs to modify it
in place. But this is getting into some fairly advanced territory, I
wouldn't like to predict what will be faster without testing on real
data.
--
Steven
More information about the Tutor
mailing list