[Neuroimaging] indexed access to gziped files
paul mccarthy
pauldmccarthy at gmail.com
Mon Mar 14 06:51:31 EDT 2016
Hi all,
This isn't so if you create an image via the fileobject itself.
Matthew, is this currently possible in nibabel? I had a quick play, and
poke through the code, but I couldn't get anything to work - it looks like
there is no "from_fileobj" method defined in the Nifti1Image class (or any
of its bases).
If this is (or will be possible), then then the problem is solved, isn't
it? Users of nibabel can just create IndexedGzipFile instances themselves,
and pass the handle to nibabel. No need for nibabel to be dependent upon
indexed_gzip - the choice would be up to the caller. Or am I missing
something here?
It's possible to create .gz files that allow seeking but are still
> compliant with all the usual standards (e.g. regular gunzip still
> works):
Nathaniel, this is definitely a possibility - I did read through those blog
posts before going down the indexed_gzip route. But I wanted a solution for
existing data, which is already in unseekable .gz format, and not burden
the owners/users/researchers with having to re-encode all of their image
data.
Having said this, I think it would be a good thing if all of our code which
writes out nifti files would use a better compression scheme, be it
seekable gzip, xz, bz2, or whatever.
The solution is to implement your own VM mapping system by registering
> your page fault routine as a SIGSEGV handler, and have it call mmap to
> manipulate the page tables.
A valid point, but I think I'll leave this one to you!
Cheers,
Paul
On 12 March 2016 at 00:55, Nathaniel Smith <njs at pobox.com> wrote:
> On Fri, Mar 11, 2016 at 2:20 PM, paul mccarthy <pauldmccarthy at gmail.com>
> wrote:
> > Hi all,
> >
> > Sorry for the delay in my joining the conversation.
> >
> > Brendan is correct - this is not a memmap solution. The approach that
> I've
> > implemented (which I have to emphasise is not my idea - I've just got it
> > working in Python) just improves random seek/read time of the
> uncompressed
> > data stream, while keeping the compressed data on disk. This is achieved
> by
> > building an index of mappings between locations in the compressed and
> > uncompressed data streams. The index can be fully built when the file is
> > initially opened, or can be built on-demand as the file handle is used.
> >
> > So once an index is built, the IndexedGzipFile class can be used to read
> in
> > parts of the compressed data, without having to decompress the entire
> file
> > every time you seek to a new location. This is what is typically required
> > when reading GZIP files, and is a fundamental limitation in the GZIP
> format.
> >
> > As Gael (and others) pointed out, using a different compression format
> would
> > remove the need for silly indexing techniques like the one that I have
> > implemented. But I figured that having something like indexed_gzip would
> > make life a bit easier for those of us who have to work with large
> amounts
> > of existing .nii.gz files, at least until a new file format is adopted.
>
> It's possible to create .gz files that allow seeking but are still
> compliant with all the usual standards (e.g. regular gunzip still
> works):
>
>
> http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
>
> It sounds likes the biopython folks are on top of this...
>
> The excellent xz tool suite has similar features:
>
>
> http://blastedbio.blogspot.com/2013/04/random-access-to-blocked-xz-format-bxzf.html
>
> > Going back to the topic of memory-mapping - I'm pretty sure that it is
> > completely impossible to achieve true memory-mapping of compressed data,
> > unless you're working at the OS kernel level.
>
> 100% pedantic and impractical correction: technically it is totally
> possible; the Dato folks did it for their numpy/SArray wrappers. The
> solution is to implement your own VM mapping system by registering
> your page fault routine as a SIGSEGV handler, and have it call mmap to
> manipulate the page tables. (If the previous sentence doesn't mean
> anything to you, then that's probably a good thing ...there's a
> difference between whether you *can* do something and whether you
> *should* ;-).)
>
> (Also, the result is unlikely to be particularly fast, and you still
> need some way to actually do the fast random access to the compressed
> disk file.)
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org
> _______________________________________________
> Neuroimaging mailing list
> Neuroimaging at python.org
> https://mail.python.org/mailman/listinfo/neuroimaging
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/neuroimaging/attachments/20160314/9cfa606e/attachment.html>
More information about the Neuroimaging
mailing list