Complex sort on big files
sturlamolden at yahoo.no
Fri Aug 5 21:31:02 EDT 2011
On Aug 1, 5:33 pm, aliman <aliman... at googlemail.com> wrote:
> I understand that sorts are stable, so I could just repeat the whole
> sort process once for each key in turn, but that would involve going
> to and from disk once for each step in the sort, and I'm wondering if
> there is a better way.
I would consider using memory mapping the file and sorting it inline.
Sorting a binary file of bytes with NumPy is as easy as this:
import numpy as np
f = np.memmap(filename, mode='rwb', dtype=np.uint8)
(You can define dtype for any C data type or struct.)
If the file is really big, use 64-bit Python.
With memory mapping you don't have to worry about processing the file
in chunks, because the operating
systems will take care of those details.
I am not sure how to achieve this (inline file sort) with standard
library mmap and timsort, so I'll leave that out.
More information about the Python-list