[Twisted-Python] linux kernel 2.6.11-rc broke twisted process pipes
Hello everyone, Please read the attached emails. I would like to understand exactly what broke twisted, so that we're sure the new 2.6.11 kernel will work reliably with twisted. This code especially will never trigger with 2.6.11 because we'll never set POLLIN|POLLOUT at the same time for writeable fd. def doRead(self): """The only way this pipe can become readable is at EOF, because the child has closed it. """ fd = self.fd r, w, x = select.select([fd], [fd], [], 0) if r and w: return CONNECTION_LOST Could you answer why anybody should call doRead on a writeable fd? Where is it written that at EOF a _write_ pipe channel becomes readable? Is it ok that "r and w" will never trigger at the same time anymore, right? Was just that a superflous assumptions? Note that with 2.6.8 and all previous linux kernels, "r and w" could be == True in normal conditions too if the pipe write buffer was empty (old kernel was setting pollin always for no good reason). What I suspect is that the above is a superflous check that was working by luck with older kernels (becasue we could set in and out at the same time even without a disconnect event), and that all you care about is to get the wakeup from the write or disconnect events. And in turn the new linux 2.6.11-rc should not work by luck anymore. Comments appreciated. Thanks.
On Feb 28, 2005, at 3:14 PM, Andrea Arcangeli wrote:
I would like to understand exactly what broke twisted, so that we're sure the new 2.6.11 kernel will work reliably with twisted.
This code especially will never trigger with 2.6.11 because we'll never set POLLIN|POLLOUT at the same time for writeable fd.
def doRead(self): """The only way this pipe can become readable is at EOF, because the child has closed it. """ fd = self.fd r, w, x = select.select([fd], [fd], [], 0) if r and w: return CONNECTION_LOST
Could you answer why anybody should call doRead on a writeable fd?
Where is it written that at EOF a _write_ pipe channel becomes readable?
Is it ok that "r and w" will never trigger at the same time anymore, right? Was just that a superflous assumptions?
Note that with 2.6.8 and all previous linux kernels, "r and w" could be == True in normal conditions too if the pipe write buffer was empty (old kernel was setting pollin always for no good reason).
What I suspect is that the above is a superflous check that was working by luck with older kernels (becasue we could set in and out at the same time even without a disconnect event), and that all you care about is to get the wakeup from the write or disconnect events. And in turn the new linux 2.6.11-rc should not work by luck anymore.
I agree this code is somewhat suboptimal. However, I do not agree that it works only by luck. In response to your main question of "why is it checking for readability at all", there is a good answer: to get the disconnect event without trying to write data. Select doesn't have a "err" fdset, so you have to select for either readability or writability. You can't select for writability when you have no data to write, or else you'll be continuously waking up. On all UNIX OSes that I know of, write pipes show up as readable in select when the other side has closed, for just this reason. I don't know if it's documented in any specs, but as far as I can tell, it's universally true. Breaking this would be a Bad Thing, for I suspect more apps than just Twisted. Of course, if it were that simple, doRead would just be implemented as "return CONNECTION_LOST". From my testing, that'd even work on BSDs. However, linux adds its own little wrinkle to the problem. On linux, the observed behavior seems to be that only one write can be outstanding at a time -- if there is data in the buffer, writable will be false and readable will be true. Otherwise, the inverse. This is quite silly...but it's what happens. As far as I can tell, at no time are both true, except when the pipe is disconnected. On BSD, a write pipe is simply never readable until the reader is closed, which is a lot more sensible. Also note, that bit of code is unnecessary if you're using pollreactor rather than selectreactor. That is, I think it should be fine with twisted if the pipe is just POLLHUP when it is disconnected rather than POLLHUP|POLLIN|POLLOUT. James
BTW, I'm sending as andrea@cpushare.com but this email is really meant to be from andrea@suse.de, the lists are set so that I can't post unless I'm subscribed and to avoid unnecessary email load, I'm avoiding to subscribe two of my accounts to the same list. On Mon, Feb 28, 2005 at 05:52:05PM -0500, James Y Knight wrote:
I agree this code is somewhat suboptimal. However, I do not agree that it works only by luck.
The thing is that readable and writeable (in select(2) terms) could be returned from linux 2.6.9 and all previous kernels immediatly without sleeping, even if there was no disconnect event. You only needed the write buffer empty for it to return POLLIN|POLLOUT without sleeping. That's what I mean with "by luck". It wasn't twisted mistake but a linux mistake apparently. That check of "r and w" definitely could be true even if the other side of the pipe didn't disconnect in past linux.
In response to your main question of "why is it checking for readability at all", there is a good answer: to get the disconnect event without trying to write data. Select doesn't have a "err" fdset,
Ok, btw even linux always returns the fd in both readable and writeable when the other side of the pipe disconnects. This because we raise a POLLERR and this is the mask to check when a fd is readable/writeable #define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR) #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR) So when POLLERR is raised, the fd is returned readable and writeable from select. Current kernel code is this: static unsigned int pipe_poll(struct file *filp, poll_table *wait) { unsigned int mask; struct inode *inode = filp->f_dentry->d_inode; struct pipe_inode_info *info = inode->i_pipe; int nrbufs; poll_wait(filp, PIPE_WAIT(*inode), wait); /* Reading only -- no need for acquiring the semaphore. */ nrbufs = info->nrbufs; mask = 0; if (filp->f_mode & FMODE_READ) { mask = (nrbufs > 0) ? POLLIN | POLLRDNORM : 0; if (!PIPE_WRITERS(*inode) && filp->f_version != PIPE_WCOUNTER(*inode)) mask |= POLLHUP; } if (filp->f_mode & FMODE_WRITE) { mask |= (nrbufs < PIPE_BUFFERS) ? POLLOUT | POLLWRNORM : 0; if (!PIPE_READERS(*inode)) mask |= POLLERR; } return mask; } With this new kernel code, a write pipe will _never_ return readable, unless it's disconnected, and it will match the the current twisted code as far as I can tell. As you can see we never set POLLOUT for a write-pipe (opened in WRONLY mode), and in turn the only way to end up in the POLLIN_SET is that POLLERR is set.
so you have to select for either readability or writability. You can't select for writability when you have no data to write, or else you'll be continuously waking up. On all UNIX OSes that I know of, write pipes show up as readable in select when the other side has closed, for just this reason. I don't know if it's documented in any specs, but as far as I can tell, it's universally true. Breaking this would be a Bad Thing, for I suspect more apps than just Twisted.
Linux never returned any information about the status of the other side of the pipe via the readable information, because readable was always returned true, since POLLIN was set unconditionally by the pipe_poll code. So we're not going to break anything in that sense, it was already broken and infact we're fixing it right now ;). POLLIN was providing absolutely zero information because it was always set unconditionally for both readers and writers. That was wrong and that's why I changed it and this seem to make the doRead code valid for the first time in linux.
Of course, if it were that simple, doRead would just be implemented as "return CONNECTION_LOST". From my testing, that'd even work on BSDs. However, linux adds its own little wrinkle to the problem. On linux, the observed behavior seems to be that only one write can be outstanding at a time -- if there is data in the buffer, writable will be false and readable will be true. Otherwise, the inverse. This is
It's never the inverse since pollin was always forced to be set, both for readers and writers, so a write fd was always returned as readable unconditionally. But you're perfectly right that if there is data in the buffer writeable was false and readable was true. (readable provides no info in linux)
quite silly...but it's what happens. As far as I can tell, at no time are both true, except when the pipe is disconnected. On BSD, a write pipe is simply never readable until the reader is closed, which is a lot more sensible. Also note, that bit of code is unnecessary if you're using pollreactor rather than selectreactor. That is, I think it should be fine with
This is great news. However note that the 2.6.11-rc5 official kernel broke my app with the pollreactor too. So something else must have been broken for the pollreactor too making the same assumption that doRead did, and it's working fine again with my pipe patch infact.
twisted if the pipe is just POLLHUP when it is disconnected rather than POLLHUP|POLLIN|POLLOUT.
It's POLLERR not POLLHUP. POLLHUP is set only when the "writer" side disconnected and you're listening to a reader fd. POLLERR is instead returned when the "reader" disconnected and you're listening to a writer fd. In select terms it means when the reader disconnects the fd will be both readable and writeable. And when the writer disconnects the fd will be reported only as readable. In poll terms it means with linux _only_ POLLERR will be set when the read disconnects, and POLLIN will never ever be set on a WRONLY pipe. So with the new fixes it seems linux does the right thing, and we at leats mimic the select behaviour that twisted expects, I hope we mimic the poll behaviour that twisted expects too. I found the openbsd implementation of pipe_poll here: http://fxr.watson.org/fxr/source/kern/sys_pipe.c?v=OPENBSD#L661 671 if (events & (POLLIN | POLLRDNORM)) { 672 if ((rpipe->pipe_buffer.cnt > 0) || 673 (rpipe->pipe_state & PIPE_EOF)) 674 revents |= events & (POLLIN | POLLRDNORM); 675 } The way I read it, even openbsd will report the fd as readable regardless if the channel is disconnected, if the buffer has something into it. Anyway linux won't do that anymore, a write fd will be now reported as readable only if the reader has disconnected, and so you're safe to assume the reader disconnected if selects returns readable && writeable. So I think we should be fine now and no changes are required in twisted at least for the select reactor side. Please double check the pollreactor side too, we'll never return POLLIN anymore for a WRONLY pipe fd. I'm going to update a semi-productive system running twisted servers using processes too, with the new pipe_poll code too to see what happens (that's the good thing of not being fully productive yet, so I can experiment a bit more ;).
On Tue, Mar 01, 2005 at 01:37:23AM +0100, Andrea Arcangeli wrote:
I'm going to update a semi-productive system running twisted servers using processes too, with the new pipe_poll code too to see what happens (that's the good thing of not being fully productive yet, so I can experiment a bit more ;).
System is up and running fine with this patch against 2.6.8 that should apply to most l-k out there, matching latest 2.6.11. This will bring linux in sync with the twisted expectations of "r && w" meaning 'reader disconnected' with select, it'll change the behaviour of poll not to return POLLIN set unconditionally, and more specifically it'll never return POLLIN for a WRONLY fd, and it'll never return POLLOUT for a RDONLY fd. --- x/fs/pipe.c.~1~ 2004-08-25 02:47:51.000000000 +0200 +++ x/fs/pipe.c 2005-03-01 02:10:50.000000000 +0100 @@ -300,14 +300,18 @@ pipe_poll(struct file *filp, poll_table poll_wait(filp, PIPE_WAIT(*inode), wait); - /* Reading only -- no need for acquiring the semaphore. */ - mask = POLLIN | POLLRDNORM; - if (PIPE_EMPTY(*inode)) - mask = POLLOUT | POLLWRNORM; - if (!PIPE_WRITERS(*inode) && filp->f_version != PIPE_WCOUNTER(*inode)) - mask |= POLLHUP; - if (!PIPE_READERS(*inode)) - mask |= POLLERR; + mask = 0; + if (filp->f_mode & FMODE_READ) { + mask |= PIPE_LEN(*inode) ? POLLIN | POLLRDNORM : 0; + if (!PIPE_WRITERS(*inode) && filp->f_version != PIPE_WCOUNTER(*inode)) + mask |= POLLHUP; + } + + if (filp->f_mode & FMODE_WRITE) { + mask |= PIPE_EMPTY(*inode) ? POLLOUT | POLLWRNORM : 0; + if (!PIPE_READERS(*inode)) + mask |= POLLERR; + } return mask; }
On Feb 28, 2005, at 7:37 PM, Andrea Arcangeli wrote:
It's never the inverse since pollin was always forced to be set, both for readers and writers, so a write fd was always returned as readable unconditionally.
I'm glad we all agree as to what the new behavior should be, but you're incorrect about the extent of the old brokenness. Simple tests prove that POLLIN was not forced to be true always: On a debian 2.6.9-1-k7 kernel: ==== import os,select r,w=os.pipe() p=select.poll() p.register(w, select.POLLIN|select.POLLOUT) print select.select([w],[w],[w],0) print p.poll(0) os.write(w, "asdf") print select.select([w],[w],[w],0) print p.poll(0) os.close(r) print select.select([w],[w],[w],0) print p.poll(0) ==== outputs: ([], [4], []) [(4, 4)] (aka POLLOUT) ([4], [], []) [(4, 1)] (aka POLLIN) ([4], [4], []) [(4, 9)] (aka POLLIN|POLLERR)
It's POLLERR not POLLHUP. POLLHUP is set only when the "writer" side disconnected and you're listening to a reader fd. POLLERR is instead returned when the "reader" disconnected and you're listening to a writer fd.
Right. Twisted treats all of POLLERR/HUP/NVAL the same, anyways.
671 if (events & (POLLIN | POLLRDNORM)) { 672 if ((rpipe->pipe_buffer.cnt > 0) || 673 (rpipe->pipe_state & PIPE_EOF)) 674 revents |= events & (POLLIN | POLLRDNORM); 675 } The way I read it, even openbsd will report the fd as readable regardless if the channel is disconnected, if the buffer has something into it.
Nope -- Notice that rpipe and wpipe are backwards for the read and write fds. "rpipe" on the write fd won't ever have any data in it. (Except of course that pipes in BSD are bidirectional!). In BSD, everything is symmetric, so it'd be very difficult for a stupid bug like reporting the same status for both ends to occur.
Are you reall sure it can't be called with pollreactor?
Sorry for asking again but I really want to be sure select won't screw things up with >1024 fds open ;)
Yes, it will get called, because linux returns (returned?) POLLERR|POLLIN when the other side is closed. pollreactor doesn't assume the connection was lost immediately when POLLERR/etc is set, if POLLIN is also set, because POLLIN|POLLHUP means there is more data available to be read from the kernel buffer that was sent before the connection closed. And, since normal transport implementations of doRead actually do try to read, if it was lying about POLLIN, that's okay: the read syscall will fail, and CONNECTION_LOST will be returned from doRead. James
On Feb 28, 2005, at 10:40 PM, James Y Knight wrote:
I'm glad we all agree as to what the new behavior should be, but you're incorrect about the extent of the old brokenness. Simple tests prove that POLLIN was not forced to be true always:
Which makes sense given that the following says "=" not "|=". - mask = POLLIN | POLLRDNORM; - if (PIPE_EMPTY(*inode)) - mask = POLLOUT | POLLWRNORM; James
On Mon, Feb 28, 2005 at 10:49:17PM -0500, James Y Knight wrote:
On Feb 28, 2005, at 10:40 PM, James Y Knight wrote:
I'm glad we all agree as to what the new behavior should be, but you're incorrect about the extent of the old brokenness. Simple tests prove that POLLIN was not forced to be true always:
Which makes sense given that the following says "=" not "|=". - mask = POLLIN | POLLRDNORM; - if (PIPE_EMPTY(*inode)) - mask = POLLOUT | POLLWRNORM;
Ok that explains why you didn't get "r and w" set at the same time, so it wasn't working by luck. It was only the 2.6.11-rc changes that broke it and that could return "r and w" set at the same time. And now they're fixed again.
On Mon, Feb 28, 2005 at 10:40:00PM -0500, James Y Knight wrote:
Right. Twisted treats all of POLLERR/HUP/NVAL the same, anyways.
Ok.
Nope -- Notice that rpipe and wpipe are backwards for the read and write fds. "rpipe" on the write fd won't ever have any data in it. (Except of course that pipes in BSD are bidirectional!). In BSD,
I didn't know it was bidirectional, I thought the peer pointer pointed to the other side of the pipe like in linux, and not the other direction of the same side.
Yes, it will get called, because linux returns (returned?) POLLERR|POLLIN when the other side is closed. pollreactor doesn't
It return_ed_ POLLERR|POLLIN, correct. Now it returns only POLLERR for a wronly fd and only POLLHUP for a rdonly fd which seems even more correct since it will prevent the suprious doRead call with pollreactor.
assume the connection was lost immediately when POLLERR/etc is set, if POLLIN is also set, because POLLIN|POLLHUP means there is more data available to be read from the kernel buffer that was sent before the connection closed. And, since normal transport implementations of
Yep, except that's a wronly fd so calling doRead can only detect a disconnect that is already detected thanks to the pollerr.
doRead actually do try to read, if it was lying about POLLIN, that's okay: the read syscall will fail, and CONNECTION_LOST will be returned from doRead.
Ok. So I'll take the exception path. The exception should be ValueError: filedescriptor out of range in select(). Anyway with the new patched kernel I'm running and with latest 2.6.11 this shouldn't happen anymore because POLLIN isn't set ;). Thanks a lot for all the help, this clarified many things.
On Mon, Feb 28, 2005 at 05:52:05PM -0500, James Y Knight wrote:
Also note, that bit of code is unnecessary if you're using pollreactor rather than selectreactor. That is, I think it should be fine with
Are you reall sure it can't be called with pollreactor? I see this in the pollreactor code: if event & POLLIN: why = selectable.doRead() inRead = True Sorry for asking again but I really want to be sure select won't screw things up with >1024 fds open ;) thanks!
participants (3)
-
Andrea Arcangeli
-
Andrea Arcangeli
-
James Y Knight