[Neuroimaging] indexed access to gziped files

Fri Mar 11 17:20:11 EST 2016

Hi all,

Sorry for the delay in my joining the conversation.

Brendan is correct - this is not a memmap solution. The approach that I've
implemented (which I have to emphasise is not my idea - I've just got it
working in Python) just improves random seek/read time of the uncompressed
data stream, while keeping the compressed data on disk. This is achieved by
building an index of mappings between locations in the compressed and
uncompressed data streams. The index can be fully built when the file is
initially opened, or can be built on-demand as the file handle is used.

So once an index is built, the IndexedGzipFile class can be used to read in
parts of the compressed data, without having to decompress the entire file
every time you seek to a new location. This is what is typically required
when reading GZIP files, and is a fundamental limitation in the GZIP
format.

As Gael (and others) pointed out, using a different compression format
would remove the need for silly indexing techniques like the one that I
have implemented. But I figured that having something like indexed_gzip
would make life a bit easier for those of us who have to work with large
amounts of existing .nii.gz files, at least until a new file format is
adopted.

Going back to the topic of memory-mapping - I'm pretty sure that it is
completely impossible to achieve true memory-mapping of compressed data,
unless you're working at the OS kernel level. This means that it is not
possible to wrap compressed data with a numpy array, because numpy arrays
require access to a raw chunk of memory (which itself could be memory
mapped, but must provide access to the raw array data). Gael pointed this
out to me during the Brainhack, and I discovered it myself about an hour
later :)

In order to use indexed_gzip in nibabel, the best that we would be able to
achieve is an ArrayProxy-like wrapper. For my requirements (visualisation),
this is perfectly acceptable. All I want to do is to pull out arbitrary 3D
volumes, and/or to pull out the time courses from individual voxels, from
arbitrarily large 4D data sets.

But, while experimenting with patching nibabel to use my IndexedGzipFile
class (instead of the GzipFile or nibabel.openers.BufferedGzipFile
classes), I discovered that instances of the nibabel Nifti1Image class do
not seem to keep file handles open once they have been created - they
appear to re-open the file (and re-create an IndexedGzipFile instance)
every time the image data is accessed through the ArrayProxy dataobj
attribute.

 So some discussion would be needed regarding how we could go about
allowing nibabel to use indexed_gzip. Do we modify nibabel? Or can we build
some sort of an index cache which allows IndexedGzipFile instances to be
created/destroyed, but existing index mappings (without having to re-create
the index every time a new IndexedGzipFile is created)?

Honestly, with the current state of indexed_gzip, we're probably still a
way off before there's even any point in having such a discussion. But I'm
keen to pursue this if the Nibabel guys are, as it would make my life
easier if I could keep using the nibabel interface, but get the speed
improvements offered by indexed_gzip.

As for Python 2 vs 3 support, I'm not an expert in writing Python
extensions - this is the first non-trivial extension that I've written. So
I'm not sure of what would be required to write an extension which would
work under both Python 2 and 3. If anybody is willing to help out, I would
really appreciate it!

Thanks, and apologies for the rant-ish nature of this email!

Paul

On 11 March 2016 at 17:46, Brendan Moloney <moloney at ohsu.edu> wrote:

> I don't see any mention of memmaps on the github page. It seems like the
> code is just storing extra bits of info for different "seek points" that
> allow you to access random parts of the file without decompressing
> everything before it.  So I think this could help with doing partial
> loading of the dataobj even without memmaps. Unless I am missing
> something...
>
> - Brendan
>
>
> ------------------------------
> *From:* Neuroimaging [neuroimaging-bounces+moloney=ohsu.edu at python.org]
> on behalf of Samuel St-Jean [stjeansam at gmail.com]
> *Sent:* Friday, March 11, 2016 12:39 AM
> *To:* Neuroimaging analysis in Python
> *Subject:* Re: [Neuroimaging] indexed access to gziped files
>
> If you ever go for all memmaped files, please also provide an easy way to
> return plain numpy arrays (like the unload arg for nibabel.load). Memmaps
> don't support all the kwargs in subfunctions, which can lead to weird
> broadcasting behavior.
>
> 2016-03-11 9:11 GMT+01:00 Chris Filo Gorgolewski <
> krzysztof.gorgolewski at gmail.com>:
>
>>
>> On Mar 10, 2016 11:00 PM, "Matthew Brett" <matthew.brett at gmail.com>
>> wrote:
>> >
>> > On Thu, Mar 10, 2016 at 10:33 PM, Gael Varoquaux
>> > <gael.varoquaux at normalesup.org> wrote:
>> > > Indeed, Paul did that at the brain hack. It's a great initiative.
>> > >
>> > > Two caveats. First it needs compiled code (which means that it cannot
>> > > just be copied in nibabel).
>> >
>> > No, but it could be an optional package used for gzip files if
>> importable.
>> +1! It would be a great optional feature.
>> >
>> > Cheers,
>> >
>> > Matthew
>> > _______________________________________________
>> > Neuroimaging mailing list
>> > Neuroimaging at python.org
>> > https://mail.python.org/mailman/listinfo/neuroimaging
>>
>> _______________________________________________
>> Neuroimaging mailing list
>> Neuroimaging at python.org
>> https://mail.python.org/mailman/listinfo/neuroimaging
>>
>>
>
> _______________________________________________
> Neuroimaging mailing list
> Neuroimaging at python.org
> https://mail.python.org/mailman/listinfo/neuroimaging
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/neuroimaging/attachments/20160311/0884e8c4/attachment.html>