[Numpy-discussion] Fwd: Numexpr-3.0 proposal

Tue Feb 16 04:52:00 EST 2016

2016-02-16 10:04 GMT+01:00 Robert McLeod <robbmcleod at gmail.com>:

> On Mon, Feb 15, 2016 at 10:43 AM, Gregor Thalhammer <
> gregor.thalhammer at gmail.com> wrote:
>
>>
>> Dear Robert,
>>
>> thanks for your effort on improving numexpr. Indeed, vectorized math
>> libraries (VML) can give a large boost in performance (~5x), except for a
>> couple of basic operations (add, mul, div), which current compilers are
>> able to vectorize automatically. With recent gcc even more functions are
>> vectorized, see https://sourceware.org/glibc/wiki/libmvec But you need
>> special flags depending on the platform (SSE, AVX present?), runtime
>> detection of processor capabilities would be nice for distributing
>> binaries. Some time ago, since I lost access to Intels MKL, I patched
>> numexpr to use Accelerate/Veclib on os x, which is preinstalled on each
>> mac, see https://github.com/geggo/numexpr.git veclib_support branch.
>>
>> As you increased the opcode size, I could imagine providing a bit to
>> switch (during runtime) between internal functions and vectorized ones,
>> that would be handy for tests and benchmarks.
>>
>
> Dear Gregor,
>
> Your suggestion to separate the opcode signature from the library used to
> execute it is very clever. Based on your suggestion, I think that the
> natural evolution of the opcodes is to specify them by function signature
> and library, using a two-level dict, i.e.
>
> numexpr.interpreter.opcodes['exp_f8f8f8'][gnu] = some_enum
> numexpr.interpreter.opcodes['exp_f8f8f8'][msvc] = some_enum +1
> numexpr.interpreter.opcodes['exp_f8f8f8'][vml] = some_enum + 2
> numexpr.interpreter.opcodes['exp_f8f8f8'][yeppp] = some_enum +3
>

Yes, by using a two level dictionary you can access the functions
implementing opcodes much faster and hence you can add much more opcodes
without too much slow-down.

>
> I want to procedurally generate opcodes.cpp and interpreter_body.cpp.  If
> I do it the way you suggested funccodes.hpp and all the many #define's
> regarding function codes in the interpreter can hopefully be removed and
> hence simplify the overall codebase. One could potentially take it a step
> further and plan (optimize) each expression, similar to what FFTW does with
> regards to matrix shape. That is, the basic way to control the library
> would be with a singleton library argument, i.e.:
>
> result = ne.evaluate( "A*log(foo**2 / bar**2", lib=vml )
>
> However, we could also permit a tuple to be passed in, where each element
> of the tuple reflects the library to use for each operation in the AST tree:
>
> result = ne.evaluate( "A*log(foo**2 / bar**2", lib=(gnu,gnu,gnu,yeppp,gnu)
> )
>
> In this case the ops are (mul,mul,div,log,mul).  The op-code picking is
> done by the Python side, and this tuple could be potentially optimized by
> numexpr rather than hand-optimized, by trying various permutations of the
> linked C math libraries. The wisdom from the planning could be pickled and
> saved in a wisdom file.  Currently Numexpr has cacheDict in util.py but
> there's no reason this can't be pickled and saved to disk. I've done a
> similar thing by creating wrappers for PyFFTW already.
>

I like the idea of various permutations of linked C math libraries to be
probed by numexpr during the initial iteration and then cached somehow.
That will probably require run-time detection of available C math libraries
(think that a numexpr binary will be able to run on different machines with
different libraries and computing capabilities), but in exchange, it will
allow for the fastest execution paths independently of the machine that
runs the code.

-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160216/7f1af473/attachment.html>