[Python-Dev] Python3 regret about deleting list.sort(cmp=...)
Terry Reedy
tjreedy at udel.edu
Sun Mar 13 23:45:27 CET 2011
On 3/13/2011 2:05 PM, Daniel Stutzbach wrote:
> On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum <guido at python.org
> <mailto:guido at python.org>> wrote:
>
> I recently advised a Googler who was sorting a large dataset and
> running out of memory. My analysis of the situation was that he was
> sorting a huge list of short lines of the form "shortstring,integer"
> with a key function that returned a tuple of the form ("shortstring",
> integer).
>
>
> As Raymond pointed out, a change I made for 3.2 significantly shrinks
> the memory footprint of sorting with a key (although it's still more
> memory-intensive than sorting with cmp).
>
> He could reduce the memory footprint further by sorting in two passes
> instead of using a tuple, leveraging the fact that Python guarantees a
> stable sort. In 3.2 or later, this technique will require roughly twice
> as much memory as just storing the list:
>
> biglist.sort(key=lambda s: int(s.split(',')[1])) # Sort by the integer
> biglist.sort(key=lambda s: s.split(',')[0]) # Sort by the shortstring
>
> I think the use cases are pretty narrow where there's plenty of memory
> for storing the list but not enough to store two copies.
Here are two variations on the idea of two pass sorting, both of which
only sort ints for duplicate shortstrings and neither of which use key=.
1. If duplication of shortstrings is (known to be) sparse, sort the
lines as they are, then scan to detect runs (slices) of duplicate
shortstrings and sort and replace those. (If sort had start and stop
index keywords, this could be done in place.)
2. If duplication of shortstrings is (known to be) heavy, read the
dataset into a shortstring-list_of_ints dict rathar than a list. Sort
the keys in a separate (much shorted) list and sort each value (list of
ints) separately. For some post-sort usages, this would be more useful
than a flat sorted list.
A third idea is to presort the dataset into chunks (a-m, n-z) or however
appropriate, perhaps as it is gathered; sort each separately; and then
concatenate. This will handle cases where even the flat list is too big
for memory.
--
Terry Jan Reedy
More information about the Python-Dev
mailing list