Status update on the NumPy & SciPy vs SSE problem?
While the manylinux PEP brings Linux up to comparable standing with Windows and Mac OS X in terms of distributing wheel files through PyPI, that does mean it still suffers from the same problem Windows does in relation to NumPy and SciPy wheels: no standardisation of the SSE capabilities of the machines. I figured that was independent of the manylinux PEP (since it affects Windows as well), but I'm also curious as to the current status (I found a couple of apparently relevant threads on the NumPy list, but figured it made more sense to just ask for an update rather than trusting my Google-fu) Cheers, Nick. P.S. I'm assuming the existing ability to publish NumPy & SciPy wheels for Mac OS X is based on Apple's tighter control of their hardware platform. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, 4 Feb 2016 21:22:32 +1000
Nick Coghlan
I figured that was independent of the manylinux PEP (since it affects Windows as well), but I'm also curious as to the current status (I found a couple of apparently relevant threads on the NumPy list, but figured it made more sense to just ask for an update rather than trusting my Google-fu)
While I'm not a Numpy maintainer, I don't think you can go much further than SSE2 (which is standard under the x86-64 definition). One factor is support by the kernel. The CentOS 5 kernel doesn't seem to support AVX, so you can't use AVX there even if your processor supports it (as the registers aren't preserved accross context switches). And one design point of manylinux is to support old Linux setups... (*) There are intermediate ISA additions between SSE2 and AVX (additions that don't require OS support), but I'm not sure they help much on compiler-vectorized code as opposed to hand-written assembly. Numpy's pre-compiled loops are typically quite straightforward as far as I've seen. One mitigation is to delegate some operations to an optimized library implementing the appropriate runtime switches: for example linear algebra is delegated by Numpy and Scipy to optimized BLAS and LINPACK libraries (which exist in various implementations such as OpenBLAS or Intel's MKL). (*) (this is an issue a JIT compiler helps circumvent: it generates optimal code for the current CPU ;-)) Regards Antoine.
On Thu, Feb 4, 2016 at 11:42 AM, Antoine Pitrou
On Thu, 4 Feb 2016 21:22:32 +1000 Nick Coghlan
wrote: I figured that was independent of the manylinux PEP (since it affects Windows as well), but I'm also curious as to the current status (I found a couple of apparently relevant threads on the NumPy list, but figured it made more sense to just ask for an update rather than trusting my Google-fu)
While I'm not a Numpy maintainer, I don't think you can go much further than SSE2 (which is standard under the x86-64 definition).
One factor is support by the kernel. The CentOS 5 kernel doesn't seem to support AVX, so you can't use AVX there even if your processor supports it (as the registers aren't preserved accross context switches). And one design point of manylinux is to support old Linux setups... (*)
I don't have precise numbers, but I can confirm we get from times to times some customer reports related to avx not being supported (because of CPU or OS).
There are intermediate ISA additions between SSE2 and AVX (additions that don't require OS support), but I'm not sure they help much on compiler-vectorized code as opposed to hand-written assembly. Numpy's pre-compiled loops are typically quite straightforward as far as I've seen.
One mitigation is to delegate some operations to an optimized library implementing the appropriate runtime switches: for example linear algebra is delegated by Numpy and Scipy to optimized BLAS and LINPACK libraries (which exist in various implementations such as OpenBLAS or Intel's MKL).
(*) (this is an issue a JIT compiler helps circumvent: it generates optimal code for the current CPU ;-))
Regards
Antoine.
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Feb 4, 2016 3:22 AM, "Nick Coghlan"
While the manylinux PEP brings Linux up to comparable standing with Windows and Mac OS X in terms of distributing wheel files through PyPI, that does mean it still suffers from the same problem Windows does in relation to NumPy and SciPy wheels: no standardisation of the SSE capabilities of the machines.
I figured that was independent of the manylinux PEP (since it affects Windows as well), but I'm also curious as to the current status (I found a couple of apparently relevant threads on the NumPy list, but figured it made more sense to just ask for an update rather than trusting my Google-fu)
I'm not entirely sure what the SSE status is of the numpy OSX wheels. I think that they may be just following Apple's guidance on this (in the sense of: we tell their compiler to target a certain OS version and then use default options beyond that), but I'm not sure. It may even differ between the 32- and 64-bit "parts" of the fat binaries. Asking on numpy-discussion might net more details. Otherwise, yeah, the current plan is to jump to SSE2 as the minimum required version as the new wheels become usable, since all the evidence seems to say that it's ubiquitous now.
P.S. I'm assuming the existing ability to publish NumPy & SciPy wheels for Mac OS X is based on Apple's tighter control of their hardware platform.
Not particularly. It's based on (a) Linux wheels aren't allowed on pypi (modulo bugs -- see pypi issue #385), (b) windows wheels are impossible because on that platform there's no F/OSS-compatible toolchain that can build cpython-abi-compatible BLAS or scipy. So OSX is what's left after all the competitors shot themselves in the foot :-) (manylinux is the solution to (a); mingwpy.github.io is the solution to (b).) Once that stuff is solved then yeah, it might be nice to have some better solution to the problem of ISA variations. But the most important part is already handled by runtime cpu sniffing in the blas library, and for the rest there are a variety of possible solutions (doing our own cpu sniffing, or adding something to wheels, or etc.), and just in general figuring out which solution is best is just not anywhere near the top of our priority list when we still can't distribute binaries to most users at all :-) -n
On Thu, Feb 4, 2016 at 9:01 AM, Nathaniel Smith
On Feb 4, 2016 3:22 AM, "Nick Coghlan"
wrote: While the manylinux PEP brings Linux up to comparable standing with Windows and Mac OS X in terms of distributing wheel files through PyPI, that does mean it still suffers from the same problem Windows does in relation to NumPy and SciPy wheels: no standardisation of the SSE capabilities of the machines.
I figured that was independent of the manylinux PEP (since it affects Windows as well), but I'm also curious as to the current status (I found a couple of apparently relevant threads on the NumPy list, but figured it made more sense to just ask for an update rather than trusting my Google-fu)
I'm not entirely sure what the SSE status is of the numpy OSX wheels. I think that they may be just following Apple's guidance on this (in the sense of: we tell their compiler to target a certain OS version and then use default options beyond that), but I'm not sure. It may even differ between the 32- and 64-bit "parts" of the fat binaries. Asking on numpy-discussion might net more details.
I'm more or less responsible for the numpy and scipy OSX wheels. The compiler flags for building come from the compiler flags for Python.org Python via distutils. As Nathaniel says, the big speed problem and opportunity is in the BLAS / LAPACK libraries, and we link against the Accelerate library for this, which comes installed on OSX. This seems to be well-tuned to the underlying hardware. Another option for BLAS / LAPACK is OpenBLAS which can do run-time CPU detection to select the fastest (and not-crashing) code-paths.
Otherwise, yeah, the current plan is to jump to SSE2 as the minimum required. version as the new wheels become usable, since all the evidence seems to say that it's ubiquitous now.
Some of that evidence for Windows is listed at https://github.com/numpy/numpy/wiki/Windows-versions Also, SSE2 instructions are part of the specification of the AMD64 architecture [1] and so, quoting from [2] "The SSE2 instruction set is supported on all 64-bit CPUs and operating systems". Cheers, Matthew [1] https://courses.cs.washington.edu/courses/cse351/12wi/supp-docs/abi.pdf [2] http://www.agner.org/optimize/optimizing_cpp.pdf
On 4 February 2016 at 21:22, Nick Coghlan
While the manylinux PEP brings Linux up to comparable standing with Windows and Mac OS X in terms of distributing wheel files through PyPI, that does mean it still suffers from the same problem Windows does in relation to NumPy and SciPy wheels: no standardisation of the SSE capabilities of the machines.
Thanks for the replies, folks! Checking I've understood the respective updates correctly: - x86_64 implies SSE2 capability - most i686 machines still in use are also SSE2 capable - Accelerate provides native BLAS/LAPACK APIs for Mac OS X - (ATLAS SSE2 or OpenBLAS) + manylinux should handle Linux - (ATLAS SSE2 or OpenBLAS) + mingwpy.github.io should handle Windows - Numba can optimise at runtime to use newer instructions when available The choice between an SSE2 build of ATLAS and OpenBLAS as the default BLAS/LAPACK implementation doesn't appear to have been made yet, but also shouldn't significantly impact the user experience of the resulting wheels. Does that sound right? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, 5 Feb 2016 21:46:54 +1000
Nick Coghlan
Thanks for the replies, folks!
Checking I've understood the respective updates correctly:
- x86_64 implies SSE2 capability - most i686 machines still in use are also SSE2 capable - Accelerate provides native BLAS/LAPACK APIs for Mac OS X - (ATLAS SSE2 or OpenBLAS) + manylinux should handle Linux - (ATLAS SSE2 or OpenBLAS) + mingwpy.github.io should handle Windows - Numba can optimise at runtime to use newer instructions when available
The choice between an SSE2 build of ATLAS and OpenBLAS as the default BLAS/LAPACK implementation doesn't appear to have been made yet, but also shouldn't significantly impact the user experience of the resulting wheels.
I'm not sure that's what you're implying, but the choice of a specific BLAS or LAPACK implementation needn't (and shouldn't) be part of manylinux, it's just a choice left to the packager. Bottom line is that the BLAS/LAPACK implementation comes linked into the specific package (or as a separate package dependency, up to the packager's preference). Regards Antoine.
On 6 February 2016 at 22:28, Antoine Pitrou
I'm not sure that's what you're implying, but the choice of a specific BLAS or LAPACK implementation needn't (and shouldn't) be part of manylinux, it's just a choice left to the packager. Bottom line is that the BLAS/LAPACK implementation comes linked into the specific package (or as a separate package dependency, up to the packager's preference).
Oh, nice - I wasn't sure if that was part of the set of external libraries packages that extension modules needed to agree on (I've never actually built NumPy et al from source myself, I've always used the distro packages or conda). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (5)
-
Antoine Pitrou
-
David Cournapeau
-
Matthew Brett
-
Nathaniel Smith
-
Nick Coghlan