In any case, we should always include a single threaded version because sometimes the computations are already parallelised at a higher level.
Is there a nice way to ship both versions? After all, most implementations of BLAS and friends do spawn OpenMP threads, so I don't think it would be outrageous to take advantage of it in more places; provided there is a nice way to fallback to a serial version when it is not available.