On Wed, 19 Jul 2017 14:59:52 +0200 Victor Stinner email@example.com wrote:
On Twitter, Raymond Hettinger wrote:
"The decision making process on Python-dev is an anti-pattern, governed by anecdotal data and ambiguity over what problem is solved."
About "anecdotal data", I would like to discuss the Python startup time.
And I would like to step back and examine the general criticism of "anecdotal data". Large software and hardware companies have the resources to conduct comprehensive surveys of how people use their products. For example, Intel might have accumulated millions of traces of critical production x86 code that they want to keep running efficiently (or even keep running at all). Apple might have thousands of third-party applications which they can simulate running on a newer version of whatever OS, core library or pieces of hardware those applications rely on. Even Google may nowadays have hundreds or thousands of critical services written in Go, and they may be able to assess the effect of further changes of the Go runtime on those services (not sure they do, but they would certainly have the resources to).
CPython is a comparatively small, disorganized and volunteer-based community. It doesn't have the resources or organization required to lead such studies on a regular basis. Chances are it will never have. So all we can rely on is 1) our respective individual experiences in the field 2) anecdotal data.
When we rewrote the Python 3 IO stack in C, we were relying on our intuition that high-performance IO is important, and on anecdotal data (micro-benchmarks) that the pure Python IO stack is slow. When Tim or Raymond tweak the lookup function for dicts, they rely on anecdotal data delivered by a few select micro-benchmarks, and their intuition that some use cases need to be fast (for example dicts with string keys or keys made up of consecutive integers). We don't have any hard data that all those optimizations are necessary for the majority of Python applications. I don't think anybody in the world has statistically sound data about the entire body of Python code, or even a sufficiently large and relevant subset thereof (such as "Python code used in production for critical services").
We aren't scientists. We are engineers and have to make with whatever anecdotes we are aware of (be they from our own experiences, or users' complaints). We can't just say "yes, there seems be a performance issue, but I'll wait until we have non-anecdotal data that it's important". Because that day will probably never come, and in the meantime our users will have fled elsewhere.