[Numpy-discussion] Single precision equivalents of missing C99 functions

Tue Jun 2 04:58:01 EDT 2009

A Monday 01 June 2009 20:26:27 Charles R Harris escrigué:
> > I suppose that the NumPy crew already experimented this divergence and
> > finally
> > used the cast approach for computing the single precision functions.
>
> It was inherited and was no doubt the simplest approach at the time. It has
> always bothered me a bit, however, and if you have good single/long double
> routines we should look at including them. It will affect the build so
> David needs to weigh in here.

Well, writing those routines is a matter of copy&paste and replace double by 
'float' and 'long double'.  That's all.

> However,
>
> > this is effectively preventing the use of optimized functions for single
> > precision (i.e. double precision 'exp' and 'log' are used instead of
> > single precision specific 'expf' and 'logf'), which could perform
> > potentially better.
>
> That depends on the architecture and how fast single vs double computations
> are. I don't know how the timings compare on current machines.

I've conducted some benchmarks myself (using `expm1f()`), and the speedup for 
using the native (but naive) simple precision implementation is a mere 6% on 
Linux.  However, I don't expect any speed-up at all on Windows on Intel 
processors, as the single precision functions in this scenario are simply 
defined as a macro that does the appropriate cast on double precision ones 
(i.e. the current NumPy approach).  By looking at the math.h header file for 
MSVC 9, it seems that some architectures like AMD64 may have an advantage here 
(the simple precision functions are not simply #define wrappers), but I don't 
have access to this architecture/OS combination.

> > So, I'm wondering if it would not be better to use a native
> > implementation instead.  Thoughts?
>
> Some benchmarks would be interesting. Could this be part of the corepy GSOC
> project?

From a performance point of view and provided that the speed-ups are not too 
noticeable (at least on my tested architecture, namely Intel Core2), I don't 
think this would be too interesting.  A much better venue for people really 
wanting high speed is to link against Intel MKL or AMD ACML.  As a matter of 
comparison, the MKL implementation for expm1f() takes just around 33 
cycles/item and is 2x faster than the Linux/GCC implementation, and 5x faster 
than the simple (and naive :) implementation that NumPy uses on non-POSIX 
platforms that do not wear an `expm1()` function (like Windows/MSVC 9).

All in all, I don't think that bothering about this would be worth the effort.  
So, I'll let Numexpr to behave exactly as NumPy for this matter then (if users 
need speed on Intel platforms they can always link it with MKL).

Thanks for the feedback anyway,

-- 
Francesc Alted