Creating Long Lists

Dan Stromberg drsalists at
Mon Feb 21 23:29:47 EST 2011

On Mon, Feb 21, 2011 at 7:24 PM, Dan Stromberg <drsalists at> wrote:

> On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack <
> zawackkfb at> wrote:
>> I have a large (10gb) data file for which I want to parse each line into
>> an object and then append this object to a list for sorting and further
>> processing.  I have noticed however that as the length of the list increases
>> the rate at which objects are added to it decreases dramatically.  My first
>> thought was that  I was nearing the memory capacity of the machine and the
>> decrease in performance was due to the os swapping things in and out of
>> memory.  When I looked at the memory usage this was not the case.  My
>> process was the only job running and was consuming 40gb of the the total
>> 130gb and no swapping processes were running.  To make sure there was not
>> some problem with the rest of my code, or the servers file system, I ran my
>> program again as it was but without the line that was appending items to the
>> list and it completed without problem indicating that the decrease in
>> performance is the result of some part of the process of appending to the
>> list.  Since other people have observed this problem as well (
>> I did not bother to further analyze or benchmark it.  Since the answers in
>> the above forums do not seem very definitive  I thought  I would inquire
>> here about what the reason for this decrease in performance is, and if there
>> is a way, or another data structure, that would avoid this problem.<>
> Do you have 130G of physical RAM, or 130G of virtual memory?  That makes a
> big difference.  (Yeah, I know, 130G of physical RAM is probably pretty rare
> today)
> Disabling garbage collection is a good idea, but if you don't have well
> over 10G of physical RAM, you'd probably better also use a (partially)
> disk-based sort.  To do otherwise would pretty much beg for swapping and a
> large slowdown.
> Merge sort works very well for very large datasets.
>  Just make your sublists be disk
> files, not in-memory lists - until you get down to a small enough sublist
> that you can sort it in memory, without thrashing.  Timsort (list_.sort())
> is excellent for in memory sorting.
> Actually, GNU sort is very good at sorting huge datasets - you could
> probably just open a subprocess to it, as long as you can make your data fit
> the line-oriented model GNU sort expects, and you have enough temporary disk
> space.

Depending on what you're doing after the sort, you might also look at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Python-list mailing list