[Numpy-discussion] I/O documentation and code

Sat Jun 20 18:24:41 EDT 2009

On Sat, Jun 20, 2009 at 5:33 PM, Ralf
Gommers<ralf.gommers at googlemail.com> wrote:
> Hi,
>
> I'm working on the I/O documentation, and have a bunch of questions.
>
> 1. The npy/npz formats are documented in lib.format and in the NEP
> (http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt). Is
> lib.format the right place to add relevant parts of the NEP, or would doc.io
> be better? Or create a separate page (maybe doc.npy_format)? And is the .npz
> format fixed or still in flux?
>
> 2. Is the .npy format version number (now at 1.0) independent of the numpy
> version numbering, when is it incremented, and will it be backwards
> compatible?
>
> 3. For a longer coherent overview of I/O, does that go in doc.io or
> routines.io.rst?
>
> 4. This page http://www.scipy.org/Data_sets_and_examples talks about
> including data sets with scipy, has this happened? Would it be possible to
> include a single small dataset in numpy for use in examples?
>
> 5. DataSource contains a lot of TODOs and behavior that is documented as a
> bug in the docstring. Is anyone working on this? If not, I can give it a go.

This was proposed as a GSoC project and I went through it, but that's
about all I know.  I can't find my notes now, but here are some
thoughts off the top of my head.  The code is here for the record
<http://svn.scipy.org/svn/numpy/trunk/numpy/lib/_datasource.py>

> TODOs that need work, or at least a yes/no decision:
> 5a. .zip and .tar support (is .tar needed?)

Would these be trivial to implement?  And since the import overhead is
deferred until it's needed I don't see the harm in including the
support...

> 5b. URLs only work if they include 'http://' (currently documented as a bug,
> which it not necessarily is. fix or document?)

I would say document, since we might have any number of protocols, so
it might not make sense to just default to http://

> 5c. _cache() does not handle compressed files, and should use
> shutils.copyfile

I never understood what this meant, but maybe I'm missing something.
If path is a compressed file then it is written to a local directory
as a compressed file.  What else does it need to handle?  Should it be
fetch archive, extract (single file or archive), cache locally?

> 5d. make abspath() more robust
> 5e. in open(), support for creating files and adding a 'subdir' parameter
> (needed?)
>

I would think there should be support for both of these.  I have some
rough scripts that I used for remote data fetching and I like it to
create a ./tmp directory and cache the file there and then clean up
after myself when I'm done.

> Does anyone have (self-contained) code using DataSource, or a suggestion for
> data on the web that can be used in examples?
>

I'm not sure if this is what you're after, but I've been using some of
these "classic published results" and there are some compressed
archives.

http://www.stanford.edu/~clint/bench/

> Cheers,
> Ralf
>

Skipper