I've been thinking about this too. I can see the scalability issues with select(), but frankly, poll(), epoll(), and even kqueue() all look similar in O() behavior to me from an API perspective. I guess the differences are in the kernel -- but is it a constant factor or an unfortunate O(N) or worse? To what extent would this be overwhelmed by overhead in the Python code we're writing around it? How bad is it to add extra register()/unregister() (or (modify()) calls per read operation?

The extra system calls add up. The interface of Tornado's IOLoop was based on epoll (where the internal state is roughly a mapping {fd: event_set}), so it requires more register/unregister operations when running on kqueue (where the internal state is roughly a set of (fd, event) pairs). This shows up in benchmarks of the HTTPServer; it's faster on platforms with epoll than platforms with kqueue. In low-concurrency scenarios it's actually faster to use select() even when kqueue is available (or maybe that's a mac-specific quirk).

-Ben