Mailman 3 ANN: NumExpr3 Alpha - NumPy-Discussion

Feb. 17, 2017

      Hi everyone,

I'm pleased to announce that a new branch of NumExpr has been developed
that will hopefully lead to a new major version release in the future.  You
can find the branch on the PyData github repository, and installation is as
follows:

git clone https://github.com/pydata/numexpr.git
cd numexpr
git checkout numexpr-3.0
python setup.py install

What's new?
==========

Faster
---------

The operations were re-written in such a way that gcc can auto-vectorize
the loops to use SIMD instructions. Each operation now has a strided and
aligned branch, which improves performance on aligned arrays by ~ 40 %. The
setup time for threads has been reduced, by removing an unnecessary
abstraction layer, and various other minor re-factorizations, resulting in
improved thread scaling.

The combination of speed-ups means that NumExpr3 often runs 200-500 %
faster than NumExpr2.6 on a machine with AVX2 support. The break-even point
with NumPy is now roughly arrays with 64k-elements, compared to
256-512k-elements for NE2.

Plot of comparative performance for NumPy versus NE2 versus NE3 over a
range of array sizes are available at:

http://entropyproduction.blogspot.ch/2017/02/introduction-to-numexpr-3-
alpha.html

More NumPy Datatypes
--------------------------------

The program was re-factorized from a ascii-encoded byte code to a struct
array, so that the operation space is now 65535 instead of 128.  As such,
support for uint8, int8, uint16, int16, uint32, uint64, and complex64 data
types was added.

NumExpr3 now uses NumPy 'safe' casting rules. If an operation doesn't
return the same result as NumPy, it's a bug.  In the future other casting
styles will be added if there is a demand for them.

More complete function set
------------------------------------

With the enhanced operation space, almost the entire C++11 cmath function
set is supported (if the compiler library has them; only C99 is expected).
Also bitwise operations were added for all integer datatypes. There are now
436 operations/functions in NE3, with more to come, compared to 190 in NE2.

Also a library-enum has been added to the op keys which allows multiple
backend libraries to be linked to the interpreter, and changed on a
per-expression basis, rather than picking between GNU std and Intel VML at
compile time, for example.

More complete Python language support
------------------------------------------------------

The Python compiler was re-written from scratch to use the CPython `ast`
module and a functional programming approach. As such, NE3 now compiles a
wider subset of the Python language. It supports multi-line evaluation, and
assignment with named temporaries.  The new compiler spends considerably
less time in Python to compile expressions, about 200 us for 'a*b' compared
to 550 us for NE2.

Compare for example:

    out_ne2 = ne2.evaluate( 'exp( -sin(2*a**2) - cos(2*b**2) - 2*a**2*b**2'
)

to:

    neObj = NumExp( '''a2 = a*a; b2 = b*b
out_magic = exp( -sin(2*a2) - cos(2*b2) - 2*a2*b2''' )

This is a contrived example but the multi-line approach will allow for
cleaner code and more sophisticated algorithms to be encapsulated in a
single NumExpr call. The convention is that intermediate assignment targets
are named temporaries if they do not exist in the calling frame, and full
assignment targets if they do, which provides a method for multiple
returns. Single-level de-referencing (e.g. `self.data`) is also supported
for increased convenience and cleaner code. Slicing still needs to be
performed above the ne3.evaluate() or ne3.NumExpr() call.

More maintainable
-------------------------

The code base was generally refactored to increase the prevalence of
single-point declarations, such that modifications don't require extensive
knowledge of the code. In NE2 a lot of code was generated by the
pre-processor using nested #defines.  That has been replaced by a
object-oriented Python code generator called by setup.py, which generates
about 15k lines of C code with 1k lines of Python. The use of generated
code with defined line numbers makes debugging threaded code simpler.

The generator also builds the autotest portion of the test submodule, for
checking equivalence between NumPy and NumExpr3 operations and functions.

What's TODO compared to NE2?
------------------------------------------

* strided complex functions
* Intel VML support (less necessary now with gcc auto-vectorization)
* bytes and unicode support
* reductions (mean, sum, prod, std)

What I'm looking for feedback on
--------------------------------------------

* String arrays: How do you use them?  How would unicode differ from bytes
strings?
* Interface: We now have a more object-oriented interface underneath the
familiar
  evaluate() interface. How would you like to use this interface?  Francesc
suggested
  generator support, as currently it's more difficult to use NumExpr within
a loop than
  it should be.

Ideas for the future
-------------------------

* vectorize real functions (such as exp, sqrt, log) similar to the
complex_functions.hpp vectorization.
* Add a keyword (likely 'yield') to indicate that a token is intended to be
changed by a generator inside a loop with each call to NumExpr.run()

If you have any thoughts or find any issues please don't hesitate to open
an issue at the Github repo. Although unit tests have been run over the
operation space there are undoubtedly a number of bugs to squash.

Sincerely,

Robert

-- 
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der Universität Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225 <061%20387%2032%2025>
robert.mcleod@unibas.ch
robert.mcleod@bsse.ethz.ch <robert.mcleod@ethz.ch>
robbmcleod@gmail.com

ANN: NumExpr3 Alpha

Robert McLeod

Francesc Alted

Daπid

Robert McLeod

Juan Nunez-Iglesias

Robert McLeod

Marten van Kerkwijk

Francesc Alted

Francesc Alted

Daπid

Robert McLeod

Juan Nunez-Iglesias

Robert McLeod

Marten van Kerkwijk

Francesc Alted

tags

participants (5)