>Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer >arithmetic, then set OWNDATA to false.
So, that two mallocs has been mentioned in project introduction. I got that wrong. 

>magnitude more time in inefficient loop selection and unnecessary writes to the FP control word?
loop selection, contribute around 2~3% in time. I implemented cache with PyThreadState_GetDict() but it didnt help. 
Even generating prepopulated dict/list in code_generator/generate_umath.py is not helping, 

Here, it the distribution of time, on addition operations. All memory related and BuildValue operations cost more than 7%, rest looping ones are around 2-3%:

Arink Verma