-----Original Message----- From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Guido van Rossum Sent: 30. október 2012 17:47 To: Kristján Valur Jónsson Cc: firstname.lastname@example.org Subject: Re: [Python-ideas] non-blocking buffered I/O
On Tue, Oct 30, 2012 at 9:11 AM, Kristján Valur Jónsson email@example.com wrote:
By the way: We found that acquiring the GIL by a random external thread
in response to the IOCP to wake up tasklets was incredibly expensive. I spent a lot of effort figuring out why that is and found no real answer. The mechanism we now use is to let the external worker thread schedule a "pending call" which is serviced by the main thread at the earliest opportunity. Also, the main thread is interrupted if it is doing a sleep. This is much more efficient.
In which Python version? The GIL has been redesigned at least once. Also the latency (not necessarily cost) to acquire the GIL varies by the sys.setswitchinterval setting. (Actually the more responsive you make it, the more it will cost you in overall performance.)
I do think that using the pending call mechanism is the right solution here.
I am talking about 2.7, of course, the python of hard working lumberjacks everywhere :)
Anyway I don't think the issue is much affected by the particular GIL implementation. Alternative a) Callback comes on arbitrary thread arbitrary thread calls PyGILState_Ensure (This causes a _dynamic thread state_ to be generated for the arbitrary thread, and the GIL to be subsequently acquired) arbitrary thread does whatever python gymnastics required to complete the IO (wake up tasklet arbitrary thread calls PyGILState_Release
For whatever reason, this approach _increased CPU usage_ on a loaded server. Latency was fine, throughput the same, and the delay in actual GIL acquisition was ok. I suspect that the problem lies with the dynamic acquisition of a thread state, and other initialization that may occur. I did experiment with having a cache of unused threadstates on the ready for external threads, but it didn't get me anywhere. This could also be the result of cache thrashing or something that doesn't show up immediately on a multicore cpu.
Alternative b) Callback comes on arbitrary thread external thread callse PyEval_SchedulePendingCall() This grabs a static lock, puts in a record, and signals to python that something needs to be done immediately. external thread calls a custom function to interrupt the main thread in the IO bound application, currently most likely sleeping in a WaitForMultipleObjects() with a timeout. Main thread wakes up from its sleep (if it was sleeping). Main thread runs python code, causing it to immediately service the scheduled pending call, causing it to perform the wait.
In reality, StacklessIO uses a slight variation of the above:
StacklessIO dispatch system Callback comes on arbitrary thread external thread schedules a completion event in its own "dispatch" buffer to be serviced by the main thread. This is protected by its own lock, and doesn't need the GIL. external thread callse PyEval_SchedulePendingCall() to "tick" the dispatch buffer external thread calls a custom function to interrupt the main thread in the IO bound application, currently most likely sleeping in a WaitForMultipleObjects() with a timeout. If main thread is sleeping: Main thread wakes up from its sleep Immediately at after sleeping, the main thread will 'tick' the dispatch queue After ticking, tasklets may have been made runnable, so the main thread may continue out into the main loop of the application to do work. If not, it may continue sleeping. Main thread runs python code, causing it to immediately service the scheduled pending call, which will tick the dispatch queue. This may be a no-op if the main thread was sleeping and was already ticked.
The issue we were facing was not with latency (although grabbing the GIL when the main thread is busy is slower than notifying it of a pending call), but with unexplained increased cpu showing up. A proxy node servicing 2000 clients or upwards would suddenly double or triple its cpu.
The reason I'm mentioning this here is that this is important. We have spent quite some time and energy on trying to figure out the most efficient way to complete IOCP from an arbitrary thread and this is the end result. Perhaps things can be done to improve this. Also, it is really important to study these things under real load, experience has shown me that the most innocuous changes that work well in the lab suddenly start behaving strangely in the field.