[Numpy-discussion] I/O documentation and code

Sat Jun 20 20:28:14 EDT 2009

On Sat, Jun 20, 2009 at 6:24 PM, Skipper Seabold <jsseabold at gmail.com>wrote:

> On Sat, Jun 20, 2009 at 5:33 PM, Ralf
> Gommers<ralf.gommers at googlemail.com> wrote:
> > Hi,
> >
> > I'm working on the I/O documentation, and have a bunch of questions.
> >
> > 1. The npy/npz formats are documented in lib.format and in the NEP
> > (http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt). Is
> > lib.format the right place to add relevant parts of the NEP, or would
> doc.io
> > be better? Or create a separate page (maybe doc.npy_format)? And is the
> .npz
> > format fixed or still in flux?
> >
> > 2. Is the .npy format version number (now at 1.0) independent of the
> numpy
> > version numbering, when is it incremented, and will it be backwards
> > compatible?
> >
> > 3. For a longer coherent overview of I/O, does that go in doc.io or
> > routines.io.rst?
> >
> > 4. This page http://www.scipy.org/Data_sets_and_examples talks about
> > including data sets with scipy, has this happened? Would it be possible
> to
> > include a single small dataset in numpy for use in examples?
> >
> > 5. DataSource contains a lot of TODOs and behavior that is documented as
> a
> > bug in the docstring. Is anyone working on this? If not, I can give it a
> go.
>
> This was proposed as a GSoC project and I went through it, but that's
> about all I know.  I can't find my notes now, but here are some
> thoughts off the top of my head.  The code is here for the record
> <http://svn.scipy.org/svn/numpy/trunk/numpy/lib/_datasource.py>
>
> > TODOs that need work, or at least a yes/no decision:
> > 5a. .zip and .tar support (is .tar needed?)
>
> Would these be trivial to implement?  And since the import overhead is
> deferred until it's needed I don't see the harm in including the
> support...
>

.zip would be similar to .gz and .bz2. These are all assumed to be single
files, .tar is usually a file archive which needs a different approach.

>
> > 5b. URLs only work if they include 'http://' (currently documented as a
> bug,
> > which it not necessarily is. fix or document?)
>
> I would say document, since we might have any number of protocols, so
> it might not make sense to just default to http://

agreed.

>
>
> > 5c. _cache() does not handle compressed files, and should use
> > shutils.copyfile
>
> I never understood what this meant, but maybe I'm missing something.
> If path is a compressed file then it is written to a local directory
> as a compressed file.  What else does it need to handle?  Should it be
> fetch archive, extract (single file or archive), cache locally?

Maybe it's about fetching data with gzip-compression, as described here
http://diveintopython.org/http_web_services/gzip_compression.html. I agree
normal read/write should work with compressed data.
For local files, a file copy operation would make more sense than reading
the file and then writing it to a new file anyway.

>
>
> > 5d. make abspath() more robust
> > 5e. in open(), support for creating files and adding a 'subdir' parameter
> > (needed?)
> >
>
> I would think there should be support for both of these.  I have some
> rough scripts that I used for remote data fetching and I like it to
> create a ./tmp directory and cache the file there and then clean up
> after myself when I'm done.
>
> > Does anyone have (self-contained) code using DataSource, or a suggestion
> for
> > data on the web that can be used in examples?
> >
>
> I'm not sure if this is what you're after, but I've been using some of
> these "classic published results" and there are some compressed
> archives.
>
> http://www.stanford.edu/~clint/bench/<http://www.stanford.edu/%7Eclint/bench/>

Something like this, but we can not rely on such a site to stay where it is
forever. Maybe it makes sense to put some files on scipy.org and use those.
We could for example use data from http://data.un.org/ , the usage terms
allow that and it's high-quality data. For example data on energy usage
(total, fossil and alternative sources for different countries).

Longer term the data sets from the learn scikit that Robert pointed out may
make it into numpy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090620/6b17e6dc/attachment.html>