[Neuroimaging] indexed access to gziped files

Fri Mar 11 17:30:25 EST 2016

Hi,

On Fri, Mar 11, 2016 at 2:20 PM, paul mccarthy <pauldmccarthy at gmail.com> wrote:
> Hi all,
>
> Sorry for the delay in my joining the conversation.
>
> Brendan is correct - this is not a memmap solution. The approach that I've
> implemented (which I have to emphasise is not my idea - I've just got it
> working in Python) just improves random seek/read time of the uncompressed
> data stream, while keeping the compressed data on disk. This is achieved by
> building an index of mappings between locations in the compressed and
> uncompressed data streams. The index can be fully built when the file is
> initially opened, or can be built on-demand as the file handle is used.
>
> So once an index is built, the IndexedGzipFile class can be used to read in
> parts of the compressed data, without having to decompress the entire file
> every time you seek to a new location. This is what is typically required
> when reading GZIP files, and is a fundamental limitation in the GZIP format.
>
> As Gael (and others) pointed out, using a different compression format would
> remove the need for silly indexing techniques like the one that I have
> implemented. But I figured that having something like indexed_gzip would
> make life a bit easier for those of us who have to work with large amounts
> of existing .nii.gz files, at least until a new file format is adopted.
>
> Going back to the topic of memory-mapping - I'm pretty sure that it is
> completely impossible to achieve true memory-mapping of compressed data,
> unless you're working at the OS kernel level. This means that it is not
> possible to wrap compressed data with a numpy array, because numpy arrays
> require access to a raw chunk of memory (which itself could be memory
> mapped, but must provide access to the raw array data). Gael pointed this
> out to me during the Brainhack, and I discovered it myself about an hour
> later :)
>
> In order to use indexed_gzip in nibabel, the best that we would be able to
> achieve is an ArrayProxy-like wrapper. For my requirements (visualisation),
> this is perfectly acceptable. All I want to do is to pull out arbitrary 3D
> volumes, and/or to pull out the time courses from individual voxels, from
> arbitrarily large 4D data sets.
>
> But, while experimenting with patching nibabel to use my IndexedGzipFile
> class (instead of the GzipFile or nibabel.openers.BufferedGzipFile classes),
> I discovered that instances of the nibabel Nifti1Image class do not seem to
> keep file handles open once they have been created - they appear to re-open
> the file (and re-create an IndexedGzipFile instance) every time the image
> data is accessed through the ArrayProxy dataobj attribute.
>
>  So some discussion would be needed regarding how we could go about allowing
> nibabel to use indexed_gzip. Do we modify nibabel? Or can we build some sort
> of an index cache which allows IndexedGzipFile instances to be
> created/destroyed, but existing index mappings (without having to re-create
> the index every time a new IndexedGzipFile is created)?
>
> Honestly, with the current state of indexed_gzip, we're probably still a way
> off before there's even any point in having such a discussion. But I'm keen
> to pursue this if the Nibabel guys are, as it would make my life easier if I
> could keep using the nibabel interface, but get the speed improvements
> offered by indexed_gzip.
>
> As for Python 2 vs 3 support, I'm not an expert in writing Python extensions
> - this is the first non-trivial extension that I've written. So I'm not sure
> of what would be required to write an extension which would work under both
> Python 2 and 3. If anybody is willing to help out, I would really appreciate
> it!
>
> Thanks, and apologies for the rant-ish nature of this email!

Please don't worry about rantishness, I didn't detect it myself :)

Yes, nibabel drops the file handles.  It could cache them, but it's
fairly easy to hit a situation where you're opening hundreds or
thousands of little image files, and that exhaust filehandles.  In
fact Gael hit this problem a few years ago, we had to add a test to
make sure we were dropping them.

This isn't so if you create an image via the fileobject itself.   I
can also imagine a non-default flag to the image loading routine to
preserve the file objects, or a default that keeps compressed file
objects while dropping uncompressed ones.

Did you consider Cython for your bindings?   It's very good for
cross-Python compatibility, and readability, if the wrapping problem
is reasonably simple.

Cheers,

Matthew