Add closing and iteration to threading.Queue
Hi! I originally submitted this as a pull request. Raymond Hettinger suggested it should be given a shakeout in python-ideas first. https://github.com/python/cpython/pull/10018 https://bugs.python.org/issue35034 ------ Briefly: Add a close() method to Queue, which should simplify many common uses of the class and reduce the space for some easy-to-make errors. Also add an __iter__() method which in conjunction with close() would further simplify some common use patterns. ------ At eye-watering length: Apologies in advance for the length of this message. This isn't a PEP in disguise, it's a proposal for a very small, simple and I dare imagine uncontroversial feature. I'm new to contributing to Python and after the BPO/github submission I didn't manage to come up with a better way to present it than this. The issue Code using threading.Queue often needs to coordinate a "work is finished as far as I care" state between the producing and consuming side. Not "work" in the task_done() sense of completion of processing of queue items, "work" in the simpler sense of just passing data through the queue. For example, a producer can be driving the communication by enqueuing e.g. names of files that need to be processed, and once it's enqueued the last filename, it can be useful to inform the consumers that no further names will be coming, so after they've retrieved what's in-flight currently, they don't need to bother waiting for any more. Alternatively, a consumer can be driving the communication, and may need to let the producers know "I'm not interested in any more, so you can stop wasting resources on producing and enqueuing them". Also, a third, coordinating component may need to let both sides know that "Our collective work here is done. Start wrapping it up y'all, but don't drop any items that are still in-flight." In practice it's probably the exception, not the rule, when any piece of code interacting with a Queue _doesn't_ have to either inform another component that its interest in transferring the data has ended, or watch for such information. In the most common case of producer letting consumers know that it's done, this is usually implemented (over and over again) with sentinel objects, which is at best needlessly verbose and at worst error-prone. A recipe for multiple consumers making sure nobody misses the sentinel is not complicated, but neither is it obvious the first time one needs to do it. When a generic sentinel (None or similar) isn't adequate, some component needs to create the sentinel object and communicate it to the others, which complicates code, and especially complicates interfaces between components that are not being developed together (e.g. if one of them is part of a library and expects the library-user code to talk to it through a Queue). In the less common cases where the producers are the ones being notified, there isn't even a typical solution - everything needs to be cooked up from scratch using synchronization primitives. ------ A solution Adding a close() method to the Queue that simply prohibits all further put()'s (with other methods acting appropriately when the queue is closed) would simplify a lot of this in a clean and safe way - for the most obvious example, multi-consumer code would not have to juggle sentinel objects. Adding a further __iter__() method (that would block as necessary, and stop its iteration once the queue is closed and exhausted) would especially simplify many unsophisticated consumers. This is a current fairly ordinary pattern: # Producer: while some_condition: q.put(generate_item()) q.put(sentinel) # Consumer: while True: item = q.get() if item == sentinel: q.put(sentinel) break process(item) (This consumer could be simplified a little with an assignment expression or an iter(q.get, sentinel), but one of those is super new and the other seems little-known in spite of being nearly old enough to vote.) With the close() and __iter__(), this would become: # Producer: with closing(q): while some_condition: q.put(generate_item()) # Consumer: for item in q: process(item) Apart from it being shorter and less error-prone (e.g. if generate_item raises), the implied interface for initializing the two components is also simplified, because there's no sentinel to pass around. More complex consumers that couldn't take advantage of the __iter__() would still benefit from being able to explicitly and readably find out (via exception or querying) that the queue has been closed and exhausted, without juggling the sentinel. I honestly think this small addition would be an unqualified win. And it would not change anything for code that doesn't want to use it. ------ I've got a sample implementation ready for Queue and its children. (Link is at the start of this message. It includes documentation updates too, in case those clarify anything further at this stage.) If this is going in the right direction, I'm happy to do the same for SimpleQueue, but I haven't done it yet. I'm still getting my bearings around the C code base. ------ To immediately answer some of Raymond's initial comments at BPO: This is completely independent from Queue's existing task-tracking protocol. One is about controlling transport, and the other about tracking the completion of processing after transport. I hesitate to call them "orthogonal", but there is no functional overlap between them at all. I keep talking about "common cases" and such weaselly concepts above, and really it's just conjecture based on my own limited experience and informal conversations. BUT, I've also done a survey of the standard library itself. There are four packages that use threading.Queue: concurrent, idlelib, logging, multiprocessing. All except idlelib have at least one piece that would have benefited from a Queue.close() if it had been available when they were being written. I've now had a look at a handful of other languages too: - Java and C# have nothing like this. They basically start from a deque as a collection, and add synchronization features to it; C++ STL doesn't even go that far. None of them do the Python thing where a Queue is a dedicated communication tool that just uses a collection as part of its implementation. - Ruby (to my mild surprise) also does nothing like this. - Clojure does just about the same thing as this proposal, yay. - Go does approximately the same thing, just more finicky about practical usage. The producer can say `close(q)` and the consumers can say `for item := range q { process(item) }` which do exactly the same thing as the proposed Python equivalents, but it has some harsh limitations that are not applicable to Python. (Can't re-close a closed channel, can't query whether a channel is closed outside of retrieving an item). ------ To anticipate a couple more possible questions: - What would this proposal do about multiple producers/consumers needing to jointly decide _when_ to close the queue? Explicitly nothing. The queue's state is either closed or not, and it doesn't care who closed it. It needs to interact correctly with multiple consumers and multiple producers, but once any one piece of code closes it, the correct interaction is acting like a closed queue for everybody. When e.g. multiple producers need to arrange among themselves that the queue be closed only after all of them are done producing, then it's not the queue's job to coordinate _that_. They probably need a Semaphore or a more complex coordinator in charge of the closing. Yes, this too is a non-trivial-to-get-right situation, but trying to solve this one inside the Queue class seems like bloat. - Why not make the Queue a context manager while you're at it? Most closeable classes do it. I don't see that as a clear win in this case. Happy to add it if there's consensus in its favour, of course. I think `with closing(q):` is much more expressive than `with q:` while still brief. The Queue is more versatile in use than files, database cursors and many other resource-type classes, so the meaning of a `with q:` would not be as immediately suggestive as with them. It also often needs to be passed around between creation and initial use (the whole point is that more than one piece of code has access to it) so the common idiom `with Queue() as q:` would be a slightly less natural fit than with resource-like classes. - Should SimpleQueue do the same thing as Queue? Yes, I would propose so and volunteer to implement it. Some details would be adapted to SimpleQueue's existing promises (put() can't fail) but I think this feature is very much in the spirit of SimpleQueue. - Should multiprocessing.Queue do the same thing? I think so, though I'm not proposing it. It already has a close() method, whose meaning is very similar but not identical to (a subset of) the proposed threading.Queue.close's meaning (with resource-management effects not relevant to threading.Queue either way). I'd hope that this "not identical" part would not cause any problems in practice (lots of things aren't identical between those two Queues), but all hoping aside, maybe people more experienced than me can evaluate if that's really the case. I also have no clear idea if the feature would be as valuable, and if the implementation would be as easy and clean as they are with threading.Queue. - Should asyncio.Queue do the same thing? Probably? I'm too unfamiliar with asyncio for my opinion on this to be of value. So I'm not proposing it. ------
On 2018-10-21 18:58, Vladimir Filipović wrote:
Hi!
I originally submitted this as a pull request. Raymond Hettinger suggested it should be given a shakeout in python-ideas first.
https://github.com/python/cpython/pull/10018 https://bugs.python.org/issue35034
[snip] FTR, this has been discussed before: [Python-ideas] `__iter__` for queues? https://mail.python.org/pipermail/python-ideas/2010-January/006711.html
On Sun, Oct 21, 2018 at 8:45 PM MRAB <python@mrabarnett.plus.com> wrote:
FTR, this has been discussed before:
[Python-ideas] `__iter__` for queues? https://mail.python.org/pipermail/python-ideas/2010-January/006711.html
Thank you! For the sake of clarity, I want to outline a few differences between that discussion and my proposal: 1. Much of the discussion there seemed to implicitly limit itself to consideration of FIFO queues. This proposal works cleanly for child classes too, including any (API-compliant) user-written children. 2. Throughout that discussion, iteration is the A feature, and closing is occasionally mentioned as a possible prerequisite. In this proposal, the A feature is closing, which enables sensible iteration (as a B feature) but is useful even if iteration isn't used. 3. There's naturally a lot of quick spitballing of various mutually-incompatible ideas there, whereas this is one rounded self-consistent proposal. Most of what I've come up with has already been anticipated there but it's all mixed up textually. 4. This proposal sidesteps a lot of the doubts and difficulties by just not using sentinels at all. Being closed is part of the queue's state that can be queried at any time, and will affect put() calls immediately, without waiting for a sentinel to float up to the front. (With recognition that your (MRAB's) message towards that thread's end already proposed the same approach.)
On 21Oct2018 21:19, Vladimir Filipović <hemflit@gmail.com> wrote:
On Sun, Oct 21, 2018 at 8:45 PM MRAB <python@mrabarnett.plus.com> wrote:
FTR, this has been discussed before: [Python-ideas] `__iter__` for queues? https://mail.python.org/pipermail/python-ideas/2010-January/006711.html
Thank you!
Hmm, yes. My post there is this one: https://mail.python.org/pipermail/python-ideas/2010-January/006716.html I want to point out that in my code the single consumer of a Queue is incredibly common, so common that I can't off the top of my head think of _any_ uses of Queue directly: I _always_ make an IterableQueue and simply have the consumer iterate over the iterable queue. This is _exactly_ like Vladimir's proposal to my mind: my IterableQueue is iterable, and has a .close method just like his (prevent further puts and indicates end of stream) mediated with a sentinel internally. Arbitrary number of putters, _usually_ only one getter but of course there's no need for that. So to my mind his proposal is very simple and sensible, and matches almost universally my own use of Queues. Cheers, Cameron Simpson <cs@cskk.id.au>
On 10/21/2018 2:42 PM, MRAB wrote:
On 2018-10-21 18:58, Vladimir Filipović wrote:
Hi!
I originally submitted this as a pull request. Raymond Hettinger suggested it should be given a shakeout in python-ideas first.
https://github.com/python/cpython/pull/10018 https://bugs.python.org/issue35034
The proposed close method would only half-close the queue: closed to puts, open to gets (but perhaps close completely when the last item is gotten.
FTR, this has been discussed before: [Python-ideas] `__iter__` for queues? https://mail.python.org/pipermail/python-ideas/2010-January/006711.html
Worth reading in relation to the new proposal. One person reported having an IterableQueue with close that is probably similar to the current proposal. -- Terry Jan Reedy
Hi Vladimir, It's great to see people revisiting these old stdlib tools. Closure tracking is definitely a big point of awkwardness for Queues. In Trio we started with a straight copy of threading.Queue, and this turned out to be a major friction point for users. We just deprecated our version of Queue and replaced it with a new design. Our new thing is probably more radical than you want to get in the stdlib (we ended up splitting the object into two pieces, a sender object and a receiver object), but you might find the discussions interesting: Manual: https://trio.readthedocs.io/en/latest/reference-core.html#using-channels-to-... A more minimal proposal to add closure tracking to trio.Queue: https://github.com/python-trio/trio/pull/573 Follow-up issue with design questions we're still thinking about (also links to earlier design discussions): https://github.com/python-trio/trio/issues/719 We only started shipping this last week, so we're still getting experience with it. -n On Sun, Oct 21, 2018, 10:59 Vladimir Filipović <hemflit@gmail.com> wrote:
Hi!
I originally submitted this as a pull request. Raymond Hettinger suggested it should be given a shakeout in python-ideas first.
https://github.com/python/cpython/pull/10018 https://bugs.python.org/issue35034
------
Briefly:
Add a close() method to Queue, which should simplify many common uses of the class and reduce the space for some easy-to-make errors.
Also add an __iter__() method which in conjunction with close() would further simplify some common use patterns.
------
At eye-watering length:
Apologies in advance for the length of this message. This isn't a PEP in disguise, it's a proposal for a very small, simple and I dare imagine uncontroversial feature. I'm new to contributing to Python and after the BPO/github submission I didn't manage to come up with a better way to present it than this.
The issue
Code using threading.Queue often needs to coordinate a "work is finished as far as I care" state between the producing and consuming side. Not "work" in the task_done() sense of completion of processing of queue items, "work" in the simpler sense of just passing data through the queue.
For example, a producer can be driving the communication by enqueuing e.g. names of files that need to be processed, and once it's enqueued the last filename, it can be useful to inform the consumers that no further names will be coming, so after they've retrieved what's in-flight currently, they don't need to bother waiting for any more. Alternatively, a consumer can be driving the communication, and may need to let the producers know "I'm not interested in any more, so you can stop wasting resources on producing and enqueuing them". Also, a third, coordinating component may need to let both sides know that "Our collective work here is done. Start wrapping it up y'all, but don't drop any items that are still in-flight."
In practice it's probably the exception, not the rule, when any piece of code interacting with a Queue _doesn't_ have to either inform another component that its interest in transferring the data has ended, or watch for such information.
In the most common case of producer letting consumers know that it's done, this is usually implemented (over and over again) with sentinel objects, which is at best needlessly verbose and at worst error-prone. A recipe for multiple consumers making sure nobody misses the sentinel is not complicated, but neither is it obvious the first time one needs to do it. When a generic sentinel (None or similar) isn't adequate, some component needs to create the sentinel object and communicate it to the others, which complicates code, and especially complicates interfaces between components that are not being developed together (e.g. if one of them is part of a library and expects the library-user code to talk to it through a Queue).
In the less common cases where the producers are the ones being notified, there isn't even a typical solution - everything needs to be cooked up from scratch using synchronization primitives.
------
A solution
Adding a close() method to the Queue that simply prohibits all further put()'s (with other methods acting appropriately when the queue is closed) would simplify a lot of this in a clean and safe way - for the most obvious example, multi-consumer code would not have to juggle sentinel objects.
Adding a further __iter__() method (that would block as necessary, and stop its iteration once the queue is closed and exhausted) would especially simplify many unsophisticated consumers.
This is a current fairly ordinary pattern:
# Producer: while some_condition: q.put(generate_item()) q.put(sentinel)
# Consumer: while True: item = q.get() if item == sentinel: q.put(sentinel) break process(item)
(This consumer could be simplified a little with an assignment expression or an iter(q.get, sentinel), but one of those is super new and the other seems little-known in spite of being nearly old enough to vote.)
With the close() and __iter__(), this would become:
# Producer: with closing(q): while some_condition: q.put(generate_item())
# Consumer: for item in q: process(item)
Apart from it being shorter and less error-prone (e.g. if generate_item raises), the implied interface for initializing the two components is also simplified, because there's no sentinel to pass around.
More complex consumers that couldn't take advantage of the __iter__() would still benefit from being able to explicitly and readably find out (via exception or querying) that the queue has been closed and exhausted, without juggling the sentinel.
I honestly think this small addition would be an unqualified win. And it would not change anything for code that doesn't want to use it.
------
I've got a sample implementation ready for Queue and its children. (Link is at the start of this message. It includes documentation updates too, in case those clarify anything further at this stage.)
If this is going in the right direction, I'm happy to do the same for SimpleQueue, but I haven't done it yet. I'm still getting my bearings around the C code base.
------
To immediately answer some of Raymond's initial comments at BPO:
This is completely independent from Queue's existing task-tracking protocol. One is about controlling transport, and the other about tracking the completion of processing after transport. I hesitate to call them "orthogonal", but there is no functional overlap between them at all.
I keep talking about "common cases" and such weaselly concepts above, and really it's just conjecture based on my own limited experience and informal conversations. BUT, I've also done a survey of the standard library itself. There are four packages that use threading.Queue: concurrent, idlelib, logging, multiprocessing. All except idlelib have at least one piece that would have benefited from a Queue.close() if it had been available when they were being written.
I've now had a look at a handful of other languages too: - Java and C# have nothing like this. They basically start from a deque as a collection, and add synchronization features to it; C++ STL doesn't even go that far. None of them do the Python thing where a Queue is a dedicated communication tool that just uses a collection as part of its implementation. - Ruby (to my mild surprise) also does nothing like this. - Clojure does just about the same thing as this proposal, yay. - Go does approximately the same thing, just more finicky about practical usage. The producer can say `close(q)` and the consumers can say `for item := range q { process(item) }` which do exactly the same thing as the proposed Python equivalents, but it has some harsh limitations that are not applicable to Python. (Can't re-close a closed channel, can't query whether a channel is closed outside of retrieving an item).
------
To anticipate a couple more possible questions:
- What would this proposal do about multiple producers/consumers needing to jointly decide _when_ to close the queue?
Explicitly nothing.
The queue's state is either closed or not, and it doesn't care who closed it. It needs to interact correctly with multiple consumers and multiple producers, but once any one piece of code closes it, the correct interaction is acting like a closed queue for everybody.
When e.g. multiple producers need to arrange among themselves that the queue be closed only after all of them are done producing, then it's not the queue's job to coordinate _that_. They probably need a Semaphore or a more complex coordinator in charge of the closing. Yes, this too is a non-trivial-to-get-right situation, but trying to solve this one inside the Queue class seems like bloat.
- Why not make the Queue a context manager while you're at it? Most closeable classes do it.
I don't see that as a clear win in this case. Happy to add it if there's consensus in its favour, of course.
I think `with closing(q):` is much more expressive than `with q:` while still brief. The Queue is more versatile in use than files, database cursors and many other resource-type classes, so the meaning of a `with q:` would not be as immediately suggestive as with them. It also often needs to be passed around between creation and initial use (the whole point is that more than one piece of code has access to it) so the common idiom `with Queue() as q:` would be a slightly less natural fit than with resource-like classes.
- Should SimpleQueue do the same thing as Queue?
Yes, I would propose so and volunteer to implement it. Some details would be adapted to SimpleQueue's existing promises (put() can't fail) but I think this feature is very much in the spirit of SimpleQueue.
- Should multiprocessing.Queue do the same thing?
I think so, though I'm not proposing it.
It already has a close() method, whose meaning is very similar but not identical to (a subset of) the proposed threading.Queue.close's meaning (with resource-management effects not relevant to threading.Queue either way). I'd hope that this "not identical" part would not cause any problems in practice (lots of things aren't identical between those two Queues), but all hoping aside, maybe people more experienced than me can evaluate if that's really the case.
I also have no clear idea if the feature would be as valuable, and if the implementation would be as easy and clean as they are with threading.Queue.
- Should asyncio.Queue do the same thing?
Probably? I'm too unfamiliar with asyncio for my opinion on this to be of value. So I'm not proposing it.
------ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, 21 Oct 2018 19:58:05 +0200 Vladimir Filipović <hemflit@gmail.com> wrote:
To anticipate a couple more possible questions:
- What would this proposal do about multiple producers/consumers needing to jointly decide _when_ to close the queue?
Explicitly nothing.
The queue's state is either closed or not, and it doesn't care who closed it. It needs to interact correctly with multiple consumers and multiple producers, but once any one piece of code closes it, the correct interaction is acting like a closed queue for everybody.
Ah. This is the one statement that makes me favorable to this idea. When there is a single consumer, it's easy enough to send a sentinel. But when there are multiple consumers, suddenly you must send exactly the right number of sentinels (which means you also have to careful keep track of their number, which isn't always easy). There's some delicate code doing exactly that in concurrent.futures.
- Should multiprocessing.Queue do the same thing?
I think so, though I'm not proposing it.
It already has a close() method, whose meaning is very similar but not identical to (a subset of) the proposed threading.Queue.close's meaning (with resource-management effects not relevant to threading.Queue either way).
Not really similar, unfortunately. mp.Queue.close() isn't a logical thing, but releases the queue's internal resources. It doesn't signal consumers that the producers has finished with the queue. Perhaps if you renamed close() to something else ("put_eof" or "put_end" perhaps?) that would allow porting it to mp.Queue? Regards Antoine.
On 2018-10-21 22:30, Antoine Pitrou wrote:
On Sun, 21 Oct 2018 19:58:05 +0200 Vladimir Filipović <hemflit@gmail.com> wrote:
To anticipate a couple more possible questions:
- What would this proposal do about multiple producers/consumers needing to jointly decide _when_ to close the queue?
Explicitly nothing.
The queue's state is either closed or not, and it doesn't care who closed it. It needs to interact correctly with multiple consumers and multiple producers, but once any one piece of code closes it, the correct interaction is acting like a closed queue for everybody.
Ah. This is the one statement that makes me favorable to this idea. When there is a single consumer, it's easy enough to send a sentinel. But when there are multiple consumers, suddenly you must send exactly the right number of sentinels (which means you also have to careful keep track of their number, which isn't always easy). There's some delicate code doing exactly that in concurrent.futures.
You don't need more than one sentinel. When a consumer sees the sentinel, it just needs to put it back for the other consumers. [snip]
On Sun, Oct 21, 2018, 16:48 MRAB <python@mrabarnett.plus.com> wrote:
On 2018-10-21 22:30, Antoine Pitrou wrote:
On Sun, 21 Oct 2018 19:58:05 +0200 Vladimir Filipović <hemflit@gmail.com> wrote:
To anticipate a couple more possible questions:
- What would this proposal do about multiple producers/consumers needing to jointly decide _when_ to close the queue?
Explicitly nothing.
The queue's state is either closed or not, and it doesn't care who closed it. It needs to interact correctly with multiple consumers and multiple producers, but once any one piece of code closes it, the correct interaction is acting like a closed queue for everybody.
Ah. This is the one statement that makes me favorable to this idea. When there is a single consumer, it's easy enough to send a sentinel. But when there are multiple consumers, suddenly you must send exactly the right number of sentinels (which means you also have to careful keep track of their number, which isn't always easy). There's some delicate code doing exactly that in concurrent.futures.
You don't need more than one sentinel. When a consumer sees the sentinel, it just needs to put it back for the other consumers.
I'm not sure if this is an issue the way Queue is used in practice, but in general you have to be careful with this kind of circular flow because if your queue communicates backpressure (which it should) then circular flows can deadlock. -n
On 21Oct2018 18:06, Nathaniel Smith <njs@pobox.com> wrote:
On Sun, Oct 21, 2018, 16:48 MRAB <python@mrabarnett.plus.com> wrote:
On 2018-10-21 22:30, Antoine Pitrou wrote:
Ah. This is the one statement that makes me favorable to this idea. When there is a single consumer, it's easy enough to send a sentinel. But when there are multiple consumers, suddenly you must send exactly the right number of sentinels (which means you also have to careful keep track of their number, which isn't always easy). There's some delicate code doing exactly that in concurrent.futures.
You don't need more than one sentinel. When a consumer sees the sentinel, it just needs to put it back for the other consumers.
Yes, this is exactly what my own IterableQUeue does.
I'm not sure if this is an issue the way Queue is used in practice, but in general you have to be careful with this kind of circular flow because if your queue communicates backpressure (which it should) then circular flows can deadlock.
Haven't come across this myself. A closeable queue doesn't seem circular to me. The handling of the sentinel is internal to the IterableQueue, so external users never see it. Cheers, Cameron Simpson <cs@cskk.id.au>
On Sun, Oct 21, 2018 at 6:08 PM Nathaniel Smith <njs@pobox.com> wrote: I'm not sure if this is an issue the way Queue is used in practice, but in general you have to be careful with this kind of circular flow because if your queue communicates backpressure (which it should) then circular flows can deadlock. Nathaniel, would you be able to elaborate more on the issue of backpressure? I think a lot of people here are not really familiar with the concepts and its importance, and it changes how you have to think about queues and the like. -- --Guido van Rossum (python.org/~guido)
Nathaniel, thank you for the pointer to Trio. Its approach seems very robust. I'm relieved to see that a solution so fundamentally rebuilt has also settled on very similar semantics for its `.close_put()`. I think your `.clone()` idiom is very clear when the communication objects are treated as distinct endpoints. Something with similar effects on closing (not necessarily similar in idiom) would probably be a neat enhancement to the standard Queue, though if I was making that, I'd do it in an external package. ------ Antoine, regarding multiprocessing.Queue: The similarity of meaning behind closing that I was getting at is that mp.Q.close() means "I am done writing to this queue, and I don't care about the rest of you", whereas the proposed meaning of q.Q.close() is "Listen up, we are all done writing to this queue". I don't know yet that this difference necessarily creates a true incompatibility. That the effects (in terms of eager OS-resource cleanup) are different shouldn't be a problem in itself - every implementation does the right thing for itself. ------ On Mon, Oct 22, 2018 at 2:03 AM Terry Reedy <tjreedy@udel.edu> wrote:
The proposed close method would only half-close the queue: closed to puts, open to gets (but perhaps close completely when the last item is gotten.
In other words: in this proposal, there is no such thing as "closed for retrieval". A closed queue means exactly that it's closed for insertion. Retrieval becomes permanently impossible once the queue is closed and exhausted, and that's a condition that get() must treat correctly and usefully, but calling that condition "closed / completely closed / closed for retrieval" would muddle up the terminology. In the proposed implementation I've called it "exhausted", a name I've picked up god-knows-when and from god-knows-where, but it seemed reasonable. ------ Regarding sentinels in general: They are a limited-purpose solution, and this proposal should make them unnecessary in 99% of the cases. Firstly, they only naturally apply to FIFO queues. You could hack your use of LIFO and priority queues to also rely on sentinels, but it's very kludgey in the general cases, not a natural fit, and not generalizable to user-created children of Queue (which Queue otherwise explicitly aspires to support). Secondly, they only work when the producer is the one driving the flow and notifying the consumer that "no more is forthcoming". They don't work when the producer is the one who needs to be notified. Thirdly, they're a potential cause of deadlocks when the same threads act as both producers and consumers. (E.g. in a parallelized breadth-first-search.) I'm sure this is the circular flow that Nathaniel was referring to, but I'll let him detail it more or correct me. Fourthly, they don't make it easy to query the Queue about whether it's closed. This probably isn't a big deal admittedly. Sure, when sentinels are adequate, they're adequate. This proposal aims to be more general-purpose than that.
On Sun, Oct 21, 2018 at 8:31 PM, Guido van Rossum <guido@python.org> wrote:
On Sun, Oct 21, 2018 at 6:08 PM Nathaniel Smith <njs@pobox.com> wrote:
I'm not sure if this is an issue the way Queue is used in practice, but in general you have to be careful with this kind of circular flow because if your queue communicates backpressure (which it should) then circular flows can deadlock.
Nathaniel, would you be able to elaborate more on the issue of backpressure? I think a lot of people here are not really familiar with the concepts and its importance, and it changes how you have to think about queues and the like.
Sure. Suppose you have some kind of producer connected to some kind of consumer. If the producer consistently runs faster than the consumer, what should happen? By default with queue.Queue, there's no limit on its internal buffer, so if the producer puts, say, 10 items per second, and the consumer only gets, say, 1 item per second, then the internal buffer grows by 9 items per second. Basically you have a memory leak, which will eventually crash your program. And well before that, your latency will become terrible. How can we avoid this? I guess we could avoid this by carefully engineering our systems to make sure that producers always run slower than consumers, but that's difficult and fragile. Instead, what we usually want to do is to dynamically detect when a producer is outrunning a consumer, and apply *backpressure*. (It's called that b/c it involves the consumer "pushing back" against the producer.) The simplest way is to put a limit on how large our Queue's buffer can grow, and make put() block if it would exceed this limit. That way producers are automatically slowed down, because they have to wait for the consumer to drain the buffer before they can continue executing. This simple approach also works well when you have several tasks arranged in a pipeline like A -> B -> C, where B gets objects from A, does some processing, and then puts new items on to C. If C is running slow, this will eventually apply backpressure to B, which will block in put(), and then since B is blocked and not calling get(), then A will eventually get backpressure too. In fact, this works fine for any acyclic network topology. If you have a cycle though, like A -> B -> C -> A, then you at least potentially have the risk of deadlock, where every task is blocked in put(), and can't continue until the downstream task calls get(), but it never will because it's blocked in put() too. Sometimes it's OK and won't deadlock, but you need to think carefully about the details to figure that out. If a task gets and puts to the same queue, like someone suggested doing for the sentinel value upthread, then that's a cycle and you need to do some more analysis. (I guess if you have a single sentinel value, then queue.Queue is probably OK, since the minimal buffer size it supports is 1? So when the last thread get()s the sentinel, it knows that there's at least 1 free space in the buffer, and can put() it back without blocking. But if there's a risk of somehow getting multiple sentinel values, or if Queues ever gain support for zero-sized buffers, then this pattern could deadlock.) There's a runnable example here: https://trio.readthedocs.io/en/latest/reference-core.html#buffering-in-chann... And I also wrote about backpressure and asyncio here: https://vorpus.org/blog/some-thoughts-on-asynchronous-api-design-in-a-post-a... -n -- Nathaniel J. Smith -- https://vorpus.org
On 2018-10-23 06:13, Nathaniel Smith wrote:
On Sun, Oct 21, 2018 at 8:31 PM, Guido van Rossum <guido@python.org> wrote:
On Sun, Oct 21, 2018 at 6:08 PM Nathaniel Smith <njs@pobox.com> wrote:
I'm not sure if this is an issue the way Queue is used in practice, but in general you have to be careful with this kind of circular flow because if your queue communicates backpressure (which it should) then circular flows can deadlock.
Nathaniel, would you be able to elaborate more on the issue of backpressure? I think a lot of people here are not really familiar with the concepts and its importance, and it changes how you have to think about queues and the like.
Sure.
Suppose you have some kind of producer connected to some kind of consumer. If the producer consistently runs faster than the consumer, what should happen? By default with queue.Queue, there's no limit on its internal buffer, so if the producer puts, say, 10 items per second, and the consumer only gets, say, 1 item per second, then the internal buffer grows by 9 items per second. Basically you have a memory leak, which will eventually crash your program. And well before that, your latency will become terrible. How can we avoid this?
[snip] The purpose of the sentinel is to tell the consumer(s) that there are no more items, that the producer has finished producing. The sentinel is the only item in the queue, there will be no more items after it, and backpressure is not an issue.
participants (7)
-
Antoine Pitrou
-
Cameron Simpson
-
Guido van Rossum
-
MRAB
-
Nathaniel Smith
-
Terry Reedy
-
Vladimir Filipović