Thanks for your thoughtful reply, Dima! I work on embedded systems as a firmware engineer and I've found that all of our products use serial UART and as you say, 99% of the HW is USB dongles, primarily from FTDI or Silabs. It seems that the development of the drivers was focused on using the Windows/Linux serial abstractions to present a comfortable interface to the user. I am not interested in second guessing this decision and would rather leave the USB layer alone and instead work with the drivers as provided.
I believe that I can best explain my desire for an asyncio implementation with a few examples. Then I will work through the asyncio source code to further my understanding of this abstraction (which is perhaps my most favorite in programming!)
The ubiquitous implementation of serial comms in python, pySerial, allows the user to write() bytes to a serial interface. Because the buffer setup is rather large, 4096 bytes, it is likely that the write function blocks for the amount of time it takes to copy the transmitted bytes to the outgoing buffer and will return well before the serial device has transmitted the signal on its TX line. To obtain a kind of synchronization, the library defines a flush() method that waits until the OS TX buffer is empty. In the Windows implementation this is a polling busy wait at 50ms intervals while in Linux it blocks on tcdrain(). My method of waiting on the Windows OS event using loop._proactor.wait_for_handle(overlapped_write.hEvent) allows for the Windows implementation to discard the busy wait in favor of event signaling.
Since pySerial isn't asyncio-ready, programmers needing some concurrency in python would use threads or asyncio thread pools. In the case of FW engineers working with embedded systems, we may like to have an async generator "task" that is reading a byte stream from a serial device as well as an awaitable write method that completes when bytes have actually been put on the transport. It seems to me that Windows/POSIX OS are each handling the completion of read/write events and that therefore python implementations that create threads are inelegant. For example, to adapt pySerial to be asyncio friendly, there is a new project, aioserial, that I will be contributing to. So far it uses thread pool to wrap the old pySerial library; it wraps function calls in loop.run_in_executor() in order to return awaitables. My hope is that with guidance from the asyncio team I can bring a well-supported async implementation to Python serial IO.
At this point, I admit that I may have lost perspective by working on systems with 32K of RAM where each thread is absolutely precious, powerful, and dangerous! Perhaps these days it is OK to spawn new threads as needed at runtime to wait on an OS thread that is itself waiting on a HW event. If the consensus is that a python-thread-based approach is best, then we don't need to look much further than wrapping IO in
loop.run_in_executor()! Nevertheless, I will continue to explore the implementation since I am always interested in energy efficiency and beautiful abstraction.My working implementation uses the _wait_for_handle() method of the IocpProactor class defined here. Let's see how/why the proof of concept is working.
wait_for_handle() recieves an
overlapped.hEvent created with win32
CreateEvent (note that a total of two would be created, one for reads, one for writes). The event is setup for signaling by using
SetCommMask with flags EV_RXFLAG | EV_TXEMPTY during initialization and then calling
WaitCommEvent with a reference to the overlapped each time new IO begins. This will cause the
overlapped.hEvent that wait_for_handle() receives to be signaled when the OS completes the IO._wait_for_handle() calls RegisterWaitWithQueue() which wraps the win32 API RegisterWaitForSingleObject. The important bit here is that this API allows for registration of a callback function to fire on completion of the event. This callback will by called with the lpParameter argument containing struct PostCallbackData data = {CompletionPort, Overlapped}, *pdata; (line 355 of overlapped.c). And so it gets called with the completion port of self._iocp and a unique address, ov.address, which is NOT the overlapped structure we are originally awaiting, according to the note at line 714: # We only create ov so we can use ov.address as a key for the cache. \ ov = _overlapped.Overlapped(NULL).
So we see how a callback is registered by the IocpProactor event loop, now let's understand then how this causes the "awaitable future" to complete at the python layer.
A "future" is created: f = _WaitHandleFuture(ov, handle, wait_handle, self, loop=self._loop). Importantly this calls the Win32 API CreateEvent - for my purposes this seems redundant at first glance, but I am afraid that it may be necessary due to the simple fact the WaitCommEvent does not take a callback! I will have to investigate further. This "future" is an instance of a subclass of _BaseWaitHandleFuture which defines a _poll() method utilizing win32 WaitForSingleObject to poll for signaled state: "If dwMilliseconds is zero, the function does not enter a wait state if the object is not signaled; it always returns immediately."
It's a bit hard to track down, but if I am understanding correctly, the "super loop" of the IocpProactor is its own
_poll(). It starts by calling GetQueuedCompletionStatus with an infinite timeout. This may answer one of my main curiosities: is this how the asyncio loop waits for multiple events from multiple threads without creating waiting threads of its own? Anyway, it retrieves the "future" from self._cache, the blank overlapped used as the cache key, 0, and the finish_wait_for_handle(trans, key, ov) function created way back in _wait_for_handle().
This callback wraps the default implementation of the _BaseWaitHandleFuture._poll() which wraps WaitForSingleObject, discussed above, and returns True if the event is signaled or false otherwise (I believe false would be an error condition?). Recall that in my implementation, "event" at this stage refers to a EV_TXEMPTY or EV_RXCHAR event, for example, setup by WaitCommEvent and SetCommMask earlier. The future's set_result() will be called with True and appended to self._results. Recall that wait_for_handle() returned this very same future to my application layer earlier, so the call to set_result() will cause the application's wait to end.
Although there may be gaps in my understanding of the asyncio IO Completion Ports proactor implementation, by following the code I am confident that my usage of IocpProactor.wait_for_handle() does not create threads in the python layer. Without this implementation, the programmer wishing to manage concurrency with serial IO must resort to 1) creating and managing an extra thread for each IO direction and device or 2) manually wrapping the serial IO using loop.run_in_executor() or 3) using the aioserial library that abstracts 2) for them.
I think that creating, managing, and destroying threads only to wait on a few bytes to arrive over a 10KBps transport is overkill.
Is it possible that there is an approach better than using wait_for_handle()? For example, the loop.add_reader(fd, callback, *args) API seems to satisfy my requirements but is not supported by IocpProactor. If there is interest, I could look into adding IocpProactor for that API. There is also the Streams abstraction that seems appropriate, but I could not figure out how to hook into it with SetCommMask, WaitCommEvent, and the overlapped structures. Yet another idea is to take what I have learned from the IocpProactor internals and copy and expose them in simplified form for my own implementation, though I'd still need a nice way to throw them on the loop.
A big thanks for following along and aiding my understanding of the asyncio paradigm!
Cheers,
J.P. Hutchins
P.S.: I am focused on Windows because I am not so worried about the POSIX implementation ;). Embedded always has Windows running anyway.