<br><br><div class="gmail_quote">On Sat, Jun 20, 2009 at 6:24 PM, Skipper Seabold <span dir="ltr"><<a href="mailto:jsseabold@gmail.com">jsseabold@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

On Sat, Jun 20, 2009 at 5:33 PM, Ralf<br>

<div class="im">Gommers<<a href="mailto:ralf.gommers@googlemail.com">ralf.gommers@googlemail.com</a>> wrote:<br>

</div><div class="im">> Hi,<br>

><br>

> I'm working on the I/O documentation, and have a bunch of questions.<br>

><br>

> 1. The npy/npz formats are documented in lib.format and in the NEP<br>

> (<a href="http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt" target="_blank">http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt</a>). Is<br>

> lib.format the right place to add relevant parts of the NEP, or would <a href="http://doc.io" target="_blank">doc.io</a><br>

> be better? Or create a separate page (maybe doc.npy_format)? And is the .npz<br>

> format fixed or still in flux?<br>

><br>

> 2. Is the .npy format version number (now at 1.0) independent of the numpy<br>

> version numbering, when is it incremented, and will it be backwards<br>

> compatible?<br>

><br>

> 3. For a longer coherent overview of I/O, does that go in <a href="http://doc.io" target="_blank">doc.io</a> or<br>

> routines.io.rst?<br>

><br>

> 4. This page <a href="http://www.scipy.org/Data_sets_and_examples" target="_blank">http://www.scipy.org/Data_sets_and_examples</a> talks about<br>

> including data sets with scipy, has this happened? Would it be possible to<br>

> include a single small dataset in numpy for use in examples?<br>

><br>

> 5. DataSource contains a lot of TODOs and behavior that is documented as a<br>

> bug in the docstring. Is anyone working on this? If not, I can give it a go.<br>

<br>

</div>This was proposed as a GSoC project and I went through it, but that's<br>

about all I know.  I can't find my notes now, but here are some<br>

thoughts off the top of my head.  The code is here for the record<br>

<<a href="http://svn.scipy.org/svn/numpy/trunk/numpy/lib/_datasource.py" target="_blank">http://svn.scipy.org/svn/numpy/trunk/numpy/lib/_datasource.py</a>><br>

<div class="im"><br>

> TODOs that need work, or at least a yes/no decision:<br>

> 5a. .zip and .tar support (is .tar needed?)<br>

<br>

</div>Would these be trivial to implement?  And since the import overhead is<br>

deferred until it's needed I don't see the harm in including the<br>

support...<br>

<div class="im"></div></blockquote><div><br>.zip would be similar to .gz and .bz2. These are all assumed to be single files, .tar is usually a file archive which needs a different approach.<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im"><br>

> 5b. URLs only work if they include 'http://' (currently documented as a bug,<br>

> which it not necessarily is. fix or document?)<br>

<br>

</div>I would say document, since we might have any number of protocols, so<br>

it might not make sense to just default to http://</blockquote><div><br>agreed.  <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br>

<div class="im"><br>

> 5c. _cache() does not handle compressed files, and should use<br>

> shutils.copyfile<br>

<br>

</div>I never understood what this meant, but maybe I'm missing something.<br>

If path is a compressed file then it is written to a local directory<br>

as a compressed file.  What else does it need to handle?  Should it be<br>

fetch archive, extract (single file or archive), cache locally?</blockquote><div><br>Maybe it's about fetching data with gzip-compression, as described here <a href="http://diveintopython.org/http_web_services/gzip_compression.html">http://diveintopython.org/http_web_services/gzip_compression.html</a>. I agree normal read/write should work with compressed data.<br>

For local files, a file copy operation would make more sense than reading the file and then writing it to a new file anyway.<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br>

<div class="im"><br>

> 5d. make abspath() more robust<br>

> 5e. in open(), support for creating files and adding a 'subdir' parameter<br>

> (needed?)<br>

><br>

<br>

</div>I would think there should be support for both of these.  I have some<br>

rough scripts that I used for remote data fetching and I like it to<br>

create a ./tmp directory and cache the file there and then clean up<br>

after myself when I'm done.<br>

<div class="im"><br>

> Does anyone have (self-contained) code using DataSource, or a suggestion for<br>

> data on the web that can be used in examples?<br>

><br>

<br>

</div>I'm not sure if this is what you're after, but I've been using some of<br>

these "classic published results" and there are some compressed<br>

archives.<br>

<br>

<a href="http://www.stanford.edu/%7Eclint/bench/" target="_blank">http://www.stanford.edu/~clint/bench/</a></blockquote><div><br>Something like this, but we can not rely on such a site to stay where it is forever. Maybe it makes sense to put some files on <a href="http://scipy.org">scipy.org</a> and use those. We could for example use data from <a href="http://data.un.org/">http://data.un.org/</a> , the usage terms allow that and it's high-quality data. For example data on energy usage (total, fossil and alternative sources for different countries).<br>

<br>Longer term the data sets from the learn scikit that Robert pointed out may make it into numpy.<br><br></div></div><br>