Complex sort on big files

sturlamolden sturlamolden at yahoo.no
Fri Aug 5 21:31:02 EDT 2011


On Aug 1, 5:33 pm, aliman <aliman... at googlemail.com> wrote:

> I understand that sorts are stable, so I could just repeat the whole
> sort process once for each key in turn, but that would involve going
> to and from disk once for each step in the sort, and I'm wondering if
> there is a better way.

I would consider using memory mapping the file and sorting it inline.
Sorting a binary file of bytes with NumPy is as easy as this:

import numpy as np
f = np.memmap(filename, mode='rwb', dtype=np.uint8)
f.sort(kind='quicksort')
del f

(You can define dtype for any C data type or struct.)

If the file is really big, use 64-bit Python.

With memory mapping you don't have to worry about processing the file
in chunks, because the operating
systems will take care of those details.

I am not sure how to achieve this (inline file sort) with standard
library mmap and timsort, so I'll leave that out.


Sturla



More information about the Python-list mailing list