[Neuroimaging] indexed access to gziped files

Nathaniel Smith njs at pobox.com
Fri Mar 11 19:55:32 EST 2016

On Fri, Mar 11, 2016 at 2:20 PM, paul mccarthy <pauldmccarthy at gmail.com> wrote:
> Hi all,
> Sorry for the delay in my joining the conversation.
> Brendan is correct - this is not a memmap solution. The approach that I've
> implemented (which I have to emphasise is not my idea - I've just got it
> working in Python) just improves random seek/read time of the uncompressed
> data stream, while keeping the compressed data on disk. This is achieved by
> building an index of mappings between locations in the compressed and
> uncompressed data streams. The index can be fully built when the file is
> initially opened, or can be built on-demand as the file handle is used.
> So once an index is built, the IndexedGzipFile class can be used to read in
> parts of the compressed data, without having to decompress the entire file
> every time you seek to a new location. This is what is typically required
> when reading GZIP files, and is a fundamental limitation in the GZIP format.
> As Gael (and others) pointed out, using a different compression format would
> remove the need for silly indexing techniques like the one that I have
> implemented. But I figured that having something like indexed_gzip would
> make life a bit easier for those of us who have to work with large amounts
> of existing .nii.gz files, at least until a new file format is adopted.

It's possible to create .gz files that allow seeking but are still
compliant with all the usual standards (e.g. regular gunzip still


It sounds likes the biopython folks are on top of this...

The excellent xz tool suite has similar features:


> Going back to the topic of memory-mapping - I'm pretty sure that it is
> completely impossible to achieve true memory-mapping of compressed data,
> unless you're working at the OS kernel level.

100% pedantic and impractical correction: technically it is totally
possible; the Dato folks did it for their numpy/SArray wrappers. The
solution is to implement your own VM mapping system by registering
your page fault routine as a SIGSEGV handler, and have it call mmap to
manipulate the page tables. (If the previous sentence doesn't mean
anything to you, then that's probably a good thing ...there's a
difference between whether you *can* do something and whether you
*should* ;-).)

(Also, the result is unlikely to be particularly fast, and you still
need some way to actually do the fast random access to the compressed
disk file.)


Nathaniel J. Smith -- https://vorpus.org

More information about the Neuroimaging mailing list