parallel distutils extensions build? use gcc -flto
hi, I have been playing around a bit with gccs link time optimization feature and found that using it actually speeds up a from scratch build of numpy due to its ability to perform parallel optimization and linking. As a bonus you also should get faster binaries due to the better optimizations lto allows. As compiling with lto does require some possibly lesser know details I wanted to share it. Prerequesits are a working gcc toolchain of at least gcc-4.8 and binutils > 2.21, gcc 4.9 is better as its faster. First of all numpy checks the long double representation by compiling a file and looking at the binary, this won't work as the od -b reimplementation here does not understand lto objects, so on x86 we must short circuit that: --- a/numpy/core/setup_common.py +++ b/numpy/core/setup_common.py @@ -174,6 +174,7 @@ def check_long_double_representation(cmd): # We need to use _compile because we need the object filename src, object = cmd._compile(body, None, None, 'c') try: + return 'IEEE_DOUBLE_LE' type = long_double_representation(pyod(object)) return type finally: Next we build numpy as usual but override the compiler, linker and ar to add our custom flags. The setup.py call would look like this: CC='gcc -fno-fat-lto-objects -flto=4 -fuse-linker-plugin -O3' \ LDSHARED='gcc -fno-fat-lto-objects -flto=4 -fuse-linker-plugin -shared -O3' AR=gcc-ar \ python setup.py build_ext Some explanation: The ar override is needed as numpy builds a static library and ar needs to know about lto objects. gcc-ar does exactly that. -flto=4 the main flag tell gcc to perform link time optimizations using 4 parallel processes. -fno-fat-lto-objects tells gcc to only build lto objects, normally it builds both an lto object and a normal object for toolchain compatibilty. If our toolchain can handle lto objects this is just a waste of time and we skip it. (The flag is default in gcc-4.9 but not 4.8) -fuse-linker-plugin directs it to run its link time optimizer plugin in the linking step, the linker must support plugins, both bfd (> 2.21) and gold linker do so. This allows for more optimizations. -O3 has to be added to the linker too as thats where the optimization occurs. In general a problem with lto is that the compiler options of all steps much match the flags used for linking. If you are using c++ or gfortran you also have to override that to use lto (CXX and FF(?)) See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for a lot more details. For some numbers on my machine a from scratch numpy build with no caching takes 1min55s, with lto on 4 it only takes 55s. Pretty neat for a much more involved optimization process. Concerning the speed gain we get by this, I ran our benchmark suite with this build, there were no really significant gains which is somewhat expected as numpy is simple C code with most function bottlenecks already inlined. So conclusion: flto seems to work well with recent gccs and allows for faster builds using the limited distutils. While probably not useful for development where compiler caching (ccache) is of utmost importance it is still interesting for projects doing one shot uncached builds (travis like CI) and have huge objects (e.g. swig or cython) and don't want to change to proper parallel build systems like bento. PS: So far I know clang also supports lto but I never used it PPS: using NPY_SEPARATE_COMPILATION=0 crashes gcc-4.9, time for a bug report. Cheers, Julian
participants (1)
-
Julian Taylor