I propose a multi-step process for implementation:
1. Write a correctly working version of the core __array_function__ machinery in pure Python, adapted from the example implementation in the NEP.
2. Rewrite bottlenecks in C.
3. Implement overrides for various NumPy functions.
I think the first step is mostly complete, but I'm particularly looking for help with the second and third steps, which should be able to happen incrementally and in parallel.
Cheers,
Stephan