[Tutor] Load Entire File into memory

Danny Yoo dyoo at hashcollision.org
Mon Nov 4 22:38:46 CET 2013


On Mon, Nov 4, 2013 at 9:41 AM, Amal Thomas <amalthomas111 at gmail.com> wrote:

> @Steven: Thank you...My input data is basically AUGC and newlines... I
> would like to know about bytearray technique. Please suggest me some links
> or reference.. I will go through the profiler and check whether the code
> maintains linearity with the input files.
>
>
Hi Amal,

I suspect that what's been missing here throughout this thread is more
concrete information about the problem's background.  I would strongly
suggest we make sure that we understand the problem before making more
assumptions.


1.  What is the nature of the operation that you are doing on your data?
 Can you briefly discuss its details?  Does it involve random-access, or is
it a sequential operation?  Are the operations independent regardless of
what line you are on, or is there some kind of dependency across lines?
 Does it involve pattern matching, or...?  Are you maintaining some
in-memory data structure as you're walking through the file?

The reason why we need to know this is because it can affect file access
patterns.  It may provide a hint as to whether or not you can avoid loading
the whole file into memory or not.  It may even effect whether or not you
can distribute your work among several computers.

Here's also why it's important to talk more about what the problem is
trying to solve.  Your question has been assuming that the dominating
factor in your program's runtime is the access of your data, and that
loading the entire file into memory will improve performance.   But I see
no evidence to support that assumption yet.  Why should I not believe that
the time that's being spent isn't being spent paging in virtual memory, for
example, due to something else in your program's operations?  In which
case, then trying to load the file entirely into memory will be
counterproductive.


2.  What is the format of your input data?  You mention it is AUGC and
newlines, but more details would be really helpful.

Why is it line-oriented, for example?  I mean that as a serious question.
 Is it significant?  Is it a FASTA file?  Is it some kind of homebrewed
format?

Please be as specific as you can be here: you may be duplicating effort
that folks who have spent _years_ on sequence-reading libraries have
already done for you.  Specifically, you might be able to reuse Biopython's
libraries for sequence IO.

    http://biopython.org/wiki/SeqIO

By trying to cook up file parsing by yourself, you may be making a mistake.
 For example, there might be issues in Python 3 due to Unicode encodings:


http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

which might contribute to an unexpected increase in the size of a string's
memory representation.  Hard to say, since it depends on a host of factors.
 But knowing that, other folks have probably encountered and solved this
problem already.  Concretely, I'm pretty sure Biopython's SeqIO does the
Right Thing in terms of reading files in binary mode and reading the line
contents as bytes, as opposed to regular strings, and representing the
sequence in some memory-efficient way.

At the very least, I know that they think about these kind of problems a
lot:

    http://web.archiveorange.com/archive/v/5dAwXDMfufikePQqtPgx

Probably a lot more than us.  :P

So if it's possible, try to leverage what's already out there.  You should
almost certainly not be writing your own sequence-reading code.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131104/cef35416/attachment.html>


More information about the Tutor mailing list