Help speeding up altered groubpy.value_counts
All, I'm a new contributor to pandas and have been working to fix a couple bugs with the value_counts methods (pull request https://github.com/pandas-dev/pandas/pull/33652). I'm looking for a bit of help in maintaining speedy performance for https://github.com/DataInformer/pandas-1/blob/value_counts_normalize/pandas/ core/groupby/generic.py The SeriesGroupBy.value_counts method required significant rewrite in order to achieve correct behavior with dropna and normalize. After fixing that, I was asked to run performance tests, which unfortunately do show a significant performance hit for that method. I have been looking at how to close that gap as much as possible, but I've found only a few minor tweaks. When I do cProfile, I don't notice any clear offenders: numpy array functions are taking a lot of time total (numpy.core._multiarray_umath.implement_array_function), but I don't see any particular functions that are slow. Similarly, timeit experiments suggest that array concatenation is relatively slow, but not much different than other options like appending in the next function (e.g. I can do something like np.diff(np.nonzero(np.r_[changes, True])) or np.diff(np.nonzero(changes), append=len(changes)) there's not much of a timing difference). I have tried to do as little as possible with the multiindex, rebuilding it at the end. I would welcome any help or suggestions for how to make SeriesGroupBy.value_counts faster. Thanks, Evan Fuller
participants (1)
-
fuller.evan@gmail.com