mixed mode arithmetic
I've been browsing the numpy source. I'm wondering about mixed-mode arithmetic on arrays. I believe the way numpy handles this is that it never does mixed arithmetic, but instead converts arrays to a common type. Arguably, that might be efficient for a mix of say, double and float. Maybe not. But for a mix of complex and a scalar type (say, CDouble * Double), it's clearly suboptimal in efficiency. So, do I understand this correctly? If so, is that something we should improve?
Neal Becker wrote:
I've been browsing the numpy source. I'm wondering about mixed-mode arithmetic on arrays. I believe the way numpy handles this is that it never does mixed arithmetic, but instead converts arrays to a common type. Arguably, that might be efficient for a mix of say, double and float. Maybe not. But for a mix of complex and a scalar type (say, CDouble * Double), it's clearly suboptimal in efficiency. So, do I understand this correctly? If so, is that something we should improve?
Reviving this old thread - I note that numpy.dot supports in-place computation for performance reasons like this c = np.empty_like(a, order='C') np.dot(a, b, out=c) However, the data type of the pre-allocated c array must match the result datatype of a times b. Now, with some accelerator hardware (i.e. tensor cores or matrix multiplication engines in GPUs), mixed precision arithmetics with relaxed floating point precision (i.e.., which are not necessarily IEEE754 conformant) but with faster performance are possible, which could be supported in downstream libraries such as cupy. Case in point, a mixed precision calculation may take half precision inputs, but accumulate in and return full precision outputs. Due to the above mentioned type consistency, the outputs would be unnecessarily demoted (truncated) to half precision again. The current API of numpy does not expose mixed precision concepts. Therefore, it would be nice if it was possible to build in support for hardware accelerated linear algebra, even if that may not be available on the standard (CPU) platforms numpy is typically compiled for. I'd be happy to flesh out some API concepts, but would be curious to first get an opinion from others. It may be necessary to weigh the complexity of adding such support explicitly against providing minimal hooks for add-on libraries in the style of JMP (for jax.numpy), or AMP (for torch). Jens
On 9/7/23 23:34, glaserj--- via NumPy-Discussion wrote:
Reviving this old thread - I note that numpy.dot supports in-place computation for performance reasons like this c = np.empty_like(a, order='C') np.dot(a, b, out=c)
However, the data type of the pre-allocated c array must match the result datatype of a times b. Now, with some accelerator hardware (i.e. tensor cores or matrix multiplication engines in GPUs), mixed precision arithmetics with relaxed floating point precision (i.e.., which are not necessarily IEEE754 conformant) but with faster performance are possible, which could be supported in downstream libraries such as cupy.
Case in point, a mixed precision calculation may take half precision inputs, but accumulate in and return full precision outputs. Due to the above mentioned type consistency, the outputs would be unnecessarily demoted (truncated) to half precision again. The current API of numpy does not expose mixed precision concepts. Therefore, it would be nice if it was possible to build in support for hardware accelerated linear algebra, even if that may not be available on the standard (CPU) platforms numpy is typically compiled for.
I'd be happy to flesh out some API concepts, but would be curious to first get an opinion from others. It may be necessary to weigh the complexity of adding such support explicitly against providing minimal hooks for add-on libraries in the style of JMP (for jax.numpy), or AMP (for torch).
Jens
If your goal is "accumulate in and return full precision outputs", then you can allocate C as the full precision type, and NumPy should do the right thing. Note it may convert the entire input array to the final dtype, rather than doing it "on the fly" which could be expensive in terms of memory. Matti
Hi Matti, The documentation for numpy.dot currently states """ out ndarray, optional Output argument. This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be C-contiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible. """ I think this means that if dot(a,b) returned FP32 for FP16 inputs, it would be consistent with this API to supply a full precision output array. All that would be needed in an actual implementation is a mixed_precision flag (or output_dtype option) for this op to override the usual type promotion rules. Do you agree? Jens
On 10/7/23 16:13, Jens Glaser via NumPy-Discussion wrote:
Hi Matti,
The documentation for numpy.dot currently states
""" out ndarray, optional Output argument. This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be C-contiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible. """
I think this means that if dot(a,b) returned FP32 for FP16 inputs, it would be consistent with this API to supply a full precision output array. All that would be needed in an actual implementation is a mixed_precision flag (or output_dtype option) for this op to override the usual type promotion rules. Do you agree?
Jens _______________________________________________
`np.dot` is strange. Could you use `np.matmul` instead, which is a real ufunc and (I think) already does this? Matti.
On Tue, Jul 11, 2023 at 10:11 AM Matti Picus <matti.picus@gmail.com> wrote:
Hi Matti,
The documentation for numpy.dot currently states
""" out ndarray, optional Output argument. This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be C-contiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible. """
I think this means that if dot(a,b) returned FP32 for FP16 inputs, it would be consistent with this API to supply a full precision output array. All that would be needed in an actual implementation is a mixed_precision flag (or output_dtype option) for this op to override the usual type
On 10/7/23 16:13, Jens Glaser via NumPy-Discussion wrote: promotion rules. Do you agree?
Jens
`np.dot` is strange. Could you use `np.matmul` instead, which is a real ufunc and (I think) already does this?
Sort of. As currently implemented, no, it won't do what Jens wants because there is no `ee->f` loop implemented (`e` is the typecode for float16, `f` for float32). Passing float16 operands and requesting a float32 output (either with `dtype=np.float32` or providing such an `out=`) will fall down to the `ff->f` loop and cause upcasting of the operands, which is not what they want. But notionally one could add an `ee->f` loop between those two that would catch this case when `dtype=np.float32` is requested. -- Robert Kern
On Mon, Jul 10, 2023 at 1:49 AM Matti Picus <matti.picus@gmail.com> wrote:
On 9/7/23 23:34, glaserj--- via NumPy-Discussion wrote:
Reviving this old thread - I note that numpy.dot supports in-place computation for performance reasons like this c = np.empty_like(a, order='C') np.dot(a, b, out=c)
However, the data type of the pre-allocated c array must match the result datatype of a times b. Now, with some accelerator hardware (i.e. tensor cores or matrix multiplication engines in GPUs), mixed precision arithmetics with relaxed floating point precision (i.e.., which are not necessarily IEEE754 conformant) but with faster performance are possible, which could be supported in downstream libraries such as cupy.
Case in point, a mixed precision calculation may take half precision inputs, but accumulate in and return full precision outputs. Due to the above mentioned type consistency, the outputs would be unnecessarily demoted (truncated) to half precision again. The current API of numpy does not expose mixed precision concepts. Therefore, it would be nice if it was possible to build in support for hardware accelerated linear algebra, even if that may not be available on the standard (CPU) platforms numpy is typically compiled for.
I'd be happy to flesh out some API concepts, but would be curious to first get an opinion from others. It may be necessary to weigh the complexity of adding such support explicitly against providing minimal hooks for add-on libraries in the style of JMP (for jax.numpy), or AMP (for torch).
Jens
If your goal is "accumulate in and return full precision outputs", then you can allocate C as the full precision type, and NumPy should do the right thing. Note it may convert the entire input array to the final dtype, rather than doing it "on the fly" which could be expensive in terms of memory.
In this case, no, `np.dot()` will raise an exception if it's not the dtype that `np.dot(a, b)` would naturally produce. It's different from the ufuncs, which do indeed behave like you describe. `np.dot()` implements its own dtype coercion logic. Jens, there's nothing that really prevents adding mixed precision operations. ufuncs let you provide loops with mixed dtypes, mostly to support special functions that take both integer and real arguments (or real and complex). But one could conceivably put in mixed-precision loops. For non-ufuncs like `np.dot()` that implement their own dtype-coercion logic, it's just a matter of coding it up. The main reasons that we don't are that it's a lot of work for possibly marginal gain to put in all of the relevant permutations and that it complicates an already overly-complicated and not-fully-coherent type promotion scheme. `np.dot()` is kind of an oddball already, and "half-precision inputs -> full-precision outputs" might be a worthwhile use case given hardware accelerators. Given that this largely affects non-numpy implementations of the Array API, you probably want to raise it with that group. numpy can implement that logic if the Array API requires it. -- Robert Kern
participants (4)
-
glaserj@ornl.gov
-
Matti Picus
-
Neal Becker
-
Robert Kern