[AstroPy] reading one line from many small fits files

Fri Aug 3 18:06:17 EDT 2012

Hi :)  Inline replies below:

On 08/03/2012 05:25 PM, Erin Sheldon wrote:
> On Fri, Aug 3, 2012 at 1:48 PM, Erik Bray<embray at stsci.edu>  wrote:
>> A few additional trials gave roughly the same results.  I'm also less
>> astonished by the speed differences, simply in that fitsio wraps
>> CFITSIO, a C library, while much of PyFITS is pure Python.  Looking at
>> the profile, it spends about 2/5th of the time just opening the file and
>> creating objects for the Header and HDU structures.  There are some more
>> micro-optimizations to be made there, but not much.  PyFITS provides a
>> very flexible and extensible object-oriented interface that simply isn't
>> possible with CFITSIO, but there's a tradeoff there in terms of raw
>> performance, since it's all in pure Python.  For example, in this
>> benchmark, PyFITS spends over half a second (cumulatively, under the
>> profiler) just on the routine for determining which HDU subclass to
>> initialize based on the header keywords--CFITSIO has no equivalent
>> routine because it doesn't even care what the HDU type is until you try
>> to read some data.  And even then the only real distinction it tries to
>> make is, "Is this an image or a table?"
>
> Hi Erik -
>
> Actually I *do* run through all the HDUs and gather all the metadata,
> even in John's case of reading a single row and column from a particular
> HDU.  It just turns out that CFITSIO provides efficient routines to
> access this information, so it doesn't incur significant overhead.  It
> might be that the efficiency in the CFITSIO routines could be translated
> to python and used in PyFits.

Okay, I wasn't entirely sure about that. Or rather, I haven't delved too 
deeply into fitsio itself, so I was just going by what I know about 
CFITSIO.  That said, fitsio has a much simpler object interface--there 
is a single FitsHDU class that covers images and binary and ASCII 
tables.  This is because, to my knowledge, those are the only major 
distinctions CFITSIO makes, aside from special cases like group HDUs 
(which are just a special case of binary tables).

PyFITS on the other hand has a rather complex class hierarchy for 
different HDU types.  This has some advantages, in that it keeps the 
code fairly simple and reusable.  It has also made it easy to support 
special HDU types in a seamless manner.  For example, it was easy to add 
HDU subclasses to support constant value image arrays used for HST data. 
  And more recently, it made it easy to add new classes for special HDUs 
that encapsulate FITS files, which could be further subclassed and 
specialized for "Headerlet" HDUs meant to encapsulate WCS data as a FITS 
file encapsulated in a FITS file.

All of the above *could* be done by tacking pure Python frontends to 
CFITSIO, and would probably still be a little bit faster.  But not 
*much* faster for the same degree of flexibility.

There are several other high-level abstractions that PyFITS provides. 
For example, support for WCS record-valued cards.  This feature has a 
surprising amount of overhead, and although I think it can be improved a 
little bit, it falls into the category of "micro-optimizations".

>> Where PyFITS really takes a big hit performance-wise is in the handling
>> of table columns, and, as Perry mentioned, the conversion from the raw
>> data to Python data types like bools and strings.  As I wrote earlier in
>> this thread, the biggest problem is that PyFITS' design has always been
>> optimized for column-based access, and is horribly inefficient for
>> row-based access, since the latter usually involves reading entire
>> columns into memory anyways.  The reason for this is mostly
>> historical--PyFITS' table interface is built on top of Numpy's recarray
>> object, which I think is pretty flawed to begin with.  At the time this
>> was necessary because PyFITS did not yet support compound dtypes in its
>> normal ndarrays.  At least I think that was the issue.  But now it seems
>> to be more of a hindrance.
>
> This is a fundamental limitation for structured numpy arrays and the
> memmap interface.

Indeed--exactly my point.  If were to write PyFITS from scratch I would 
probably not rely on Numpy as my primary I/O interface, but would 
instead serve up Numpy arrays as a high-level abstraction.  Under the 
hood I would probably do something more closely resembling CFITSIO's 
buffer ring (the returned Numpy arrays would of course not be contiguous 
by default).

> I developed a C code to get around this limitation and added it to
> numpy.  This can be used by pyfits for tables (but not variable length
> columns). Unfortunately that project got a bit hijacked by other
> developers because they want the ascii reading portion of the code to do
> type inference. This has dramatically increased the scope of the project
> and delayed things indefinitely.  I am no longer involved, but it is my
> understanding that this functionality should be available in an upcoming
> version of numpy.
>
> -e

You'll have to tell me more about exactly what this Numpy development 
is--if it's something I can still incorporate into PyFITS (nevermind 
what the Numpy people are doing with it) it could be very helpful.  We 
can take that discussion offlist if you want.

Thanks,

Erik