
Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer arithmetic, then set OWNDATA to false. So, that two mallocs has been mentioned in project introduction. I got that wrong.
magnitude more time in inefficient loop selection and unnecessary writes to the FP control word? loop selection, contribute around 2~3% in time. I implemented cache with PyThreadState_GetDict() but it didnt help. Even generating prepopulated dict/list in code_generator/generate_umath.py is not helping,
Here, it the distribution of time, on addition operations. All memory related and BuildValue operations cost more than 7%, rest looping ones are around 2-3%: - PyUFunc_AddititonTypeResolver(7.6%) - *SimpleBinaryOperationTypeResolver(6.2%)* - *execute_legacy_ufunc_loop(20.7%)* - trivial_three_operand_loop(8.6%) ,this will be around 3.4% when pr # 3521 <https://github.com/numpy/numpy/pull/3521> get merged - *PYArray_NewFromDescr(7.3%)* - PyUFunc_DefaultLegacyInnerLoopSelector(2.5%) - PyUFunc_GetPyValues(12.0%) - *_extract_pyvals(9.2%)* - *PyArray_Return(14.3%)* -- Arink Verma www.arinkverma.in