Is it possible to add a method to perform a dot product and add the result to an existing matrix in a single operation ?

Like C = dot_add(A, B, C) equivalent to C += A @ B.This behavior is natively proposed by the Blas *gemm primitive.

The goal is to reduce the peak memory consumption. Indeed, during the computation of C += A @ B, the maximum allocated memory is twice the size of C.Using *gemm to add directly the result , the maximum memory consumption is less than 1.5x the size of C. 
This difference is significant for large matrices.

Any people interested in it ?