pandas microperformance, do we care?
I've noticed there's been a slow degrading in pandas microperformance as time has gone by. I looked into this when I found that df.icol(i) has been deprecated in favor of df.iloc[:, i] df = pd.DataFrame(np.random.randn(10, 5)) So here we go: pandas v0.12 %timeit df.icol(2) 100000 loops, best of 3: 13.5 µs per loop pandas v0.18 %timeit df.icol(2) 10000 loops, best of 3: 25.4 µs per loop In [6]: timeit df.iloc[:, 2] 10000 loops, best of 3: 60.8 µs per loop Once upon a time, I spent a lot of time shaving microseconds off some of these data accessor methods. For example, pandas v0.12 again: In [17]: s = df[2] In [18]: timeit s.get_value(5) 1000000 loops, best of 3: 609 ns per loop In [21]: timeit s[5] 1000000 loops, best of 3: 860 ns per loop And pandas v0.18 In [15]: timeit s.get_value(5) 100000 loops, best of 3: 7.17 µs per loop In [16]: timeit s[5] 100000 loops, best of 3: 9.31 µs per loop I understand that the performance was made worse in order to add various layers of indirection in order to make new features available (and fix bugs). I'm hoping as part of looking at revamping pandas's internals (and closing the gap to the "metal") that we are able to tighten up some of these "inner loop" methods, preferably back to pandas 0.12-level performance. It's true that writing a lot of Python for-loops isn't optimal for lots of reasons, but we should avoid overly penalizing users when this does happen. Thanks, Wes
participants (1)
-
Wes McKinney