[AstroPy] reading one line from many small fits files
Erik Bray
embray at stsci.edu
Fri Aug 3 13:48:15 EDT 2012
On 08/02/2012 11:40 PM, John K. Parejko wrote:
> Follow up on this:
>
> Erin's suggestion to use fitsio gave me a factor of more than 10
> improvement in speed. I was quite astonished at how much faster it was,
> so I've written up a short example, and attached it. On my laptop (13"
> macbook pro, OS X 10.6.8, regular HDD), the code produces the following:
>
> $ python fits_tester.py
> fitsio version: 0.9.0
> pyfits version: 3.0.6
> Single pass: fitsio took 1.14109 seconds.
> Single pass: pyfits took 14.64361 seconds.
>
> One of the problems with the pyfits version is that I don't know how to
> efficiently get at row(n) of a pyfits object in a form that can be
> directly ingested into an ndarray. If there is a way to make the pyfits
> version significantly faster just by calling pyfits differently, I'm all
> ears.
>
> Looking at the profiles for the runs (output to .prof files), it looks
> like pyfits is doing a lot of object creation and destruction in the
> background, which may be what's killing it.
>
> Anyway, there does seem to be a major difference in speed here, even in
> what is probably the most favorable configuration for pyfits, with it
> running last and thus having files potentially cached.
>
> Assuming this difference isn't just me, is way to get these speed
> improvements merged into pyfits?
>
> John
Thanks John for this benchmarking--this is very helpful. For what it's
worth, a lot of improvements have been made since PyFITS 3.0.6, and
these are the results I'm getting on my end:
fitsio version: 0.9.0
pyfits version: 3.1.dev
Single pass: fitsio took 1.62691 seconds.
Single pass: pyfits took 7.50556 seconds.
A few additional trials gave roughly the same results. I'm also less
astonished by the speed differences, simply in that fitsio wraps
CFITSIO, a C library, while much of PyFITS is pure Python. Looking at
the profile, it spends about 2/5th of the time just opening the file and
creating objects for the Header and HDU structures. There are some more
micro-optimizations to be made there, but not much. PyFITS provides a
very flexible and extensible object-oriented interface that simply isn't
possible with CFITSIO, but there's a tradeoff there in terms of raw
performance, since it's all in pure Python. For example, in this
benchmark, PyFITS spends over half a second (cumulatively, under the
profiler) just on the routine for determining which HDU subclass to
initialize based on the header keywords--CFITSIO has no equivalent
routine because it doesn't even care what the HDU type is until you try
to read some data. And even then the only real distinction it tries to
make is, "Is this an image or a table?"
So in simply opening files you'll always get better performance with
fitsio. That said, when I amend the benchmark to just open files and
read the headers (without touching the data) fitsio is only about three
times faster. Still a big difference when dealing with a lot of files,
but far less dramatic.
Where PyFITS really takes a big hit performance-wise is in the handling
of table columns, and, as Perry mentioned, the conversion from the raw
data to Python data types like bools and strings. As I wrote earlier in
this thread, the biggest problem is that PyFITS' design has always been
optimized for column-based access, and is horribly inefficient for
row-based access, since the latter usually involves reading entire
columns into memory anyways. The reason for this is mostly
historical--PyFITS' table interface is built on top of Numpy's recarray
object, which I think is pretty flawed to begin with. At the time this
was necessary because PyFITS did not yet support compound dtypes in its
normal ndarrays. At least I think that was the issue. But now it seems
to be more of a hindrance.
In any case, I'm glad fitsio is available too. It's clear from this
experiment that in cases where reading and parsing FITS files is the
major bottleneck, it's probably the way to go for now. I don't know how
much time there will be going forward to devote to improving PyFITS in
this regard.
Erik
More information about the AstroPy
mailing list