Re: [pypy-dev] Mysterious IndexError in service running with PyPy
Where is the code that changes the size of self.heap? How do we know that size(self.heap) is constant? My guess is that some thread changes this; but l is not recomputed. On 18 Dec 2017 6:59 PM, "hubo" <hubo@jiedaibao.com> wrote:
I'm reporting this issue in this mail group, though I don't know if it is related with PyPy, because it is really strange, and is not able to reproduce stably. But I hope someone may know the reason or have some points.
I'm running some SDN services written in Python with PyPy 5.3.0. The related code is here:
https://github.com/hubo1016/vlcp/blob/8022e3a3c67cf4305af503d507640a730ca394...
The full code is also in the repo, but may be too complex to describe. But this related piece is quite simple:
def _siftdown(self, pos): temp = self.heap[pos] l = len(self.heap) while pos * 2 + 1 < l: cindex = pos * 2 + 1 pt = self.heap[cindex] if cindex + 1 < l and self.heap[cindex+1][0] < pt[0]: cindex = cindex + 1 pt = self.heap[cindex] if pt[0] < temp[0]: self.heap[pos] = pt self.index[pt[1]] = pos else: break pos = cindex self.heap[pos] = temp self.index[temp[1]] = pos It is a simple heap operation. The service uses a heap to process timers. When the service is not busy, it usually runs this piece of code several times per minute.
I have 32 servers running this service. They are quite stable in about three months, but one day one of the services crashes on line 100 reporting IndexError: pt = self.heap[cindex]
As you can see, cindex = pos * 2 + 1, which is tested by the while pre-conditon just two lines before. And there is not any multi-threading issues here because this piece of code always runs in the same thread. So it is not possible in theory for this to happen.
Only the following facts are known about this issue:
1. It reproduces - through quite rarely. I've met this kind of crashes 4 times, each with different machine, so it should not be related to hardware issuess. Since I've got 32 servers, it might take more than one year to reproduce with a single server. 2. It seems not to be related to pressures. All of the crashes happens at night, when there are little requests. Only some cleanup tasks are running in fixed interval.
The services are running with PyPy 5.3.0. I've upgraded a few of them to 5.9, but it will take a long time to validate whether this still happens. And It is not validated on CPython too. I'm also trying to collect more debugging information for this issue, but it is very hard since it rarely reproduces.
It is not a serious issue. It could be workarounded with a auto-restart, but I'm searching the cause.
2017-12-18 ------------------------------ hubo
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
It is a regular guess, but this method is guareenteed to be called only in the main thread. Thread pool usage is very limited in this program because it is coroutine-based. 2017-12-22 hubo 发件人:William ML Leslie <william.leslie.ttg@gmail.com> 发送时间:2017-12-19 20:14 主题:Re: [pypy-dev] Mysterious IndexError in service running with PyPy 收件人:"hubo"<hubo@jiedaibao.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Where is the code that changes the size of self.heap? How do we know that size(self.heap) is constant? My guess is that some thread changes this; but l is not recomputed. On 18 Dec 2017 6:59 PM, "hubo" <hubo@jiedaibao.com> wrote: I'm reporting this issue in this mail group, though I don't know if it is related with PyPy, because it is really strange, and is not able to reproduce stably. But I hope someone may know the reason or have some points. I'm running some SDN services written in Python with PyPy 5.3.0. The related code is here: https://github.com/hubo1016/vlcp/blob/8022e3a3c67cf4305af503d507640a730ca394... The full code is also in the repo, but may be too complex to describe. But this related piece is quite simple: def _siftdown(self, pos): temp = self.heap[pos] l = len(self.heap) while pos * 2 + 1 < l: cindex = pos * 2 + 1 pt = self.heap[cindex] if cindex + 1 < l and self.heap[cindex+1][0] < pt[0]: cindex = cindex + 1 pt = self.heap[cindex] if pt[0] < temp[0]: self.heap[pos] = pt self.index[pt[1]] = pos else: break pos = cindex self.heap[pos] = temp self.index[temp[1]] = pos It is a simple heap operation. The service uses a heap to process timers. When the service is not busy, it usually runs this piece of code several times per minute. I have 32 servers running this service. They are quite stable in about three months, but one day one of the services crashes on line 100 reporting IndexError: pt = self.heap[cindex] As you can see, cindex = pos * 2 + 1, which is tested by the while pre-conditon just two lines before. And there is not any multi-threading issues here because this piece of code always runs in the same thread. So it is not possible in theory for this to happen. Only the following facts are known about this issue: It reproduces - through quite rarely. I've met this kind of crashes 4 times, each with different machine, so it should not be related to hardware issuess. Since I've got 32 servers, it might take more than one year to reproduce with a single server. It seems not to be related to pressures. All of the crashes happens at night, when there are little requests. Only some cleanup tasks are running in fixed interval. The services are running with PyPy 5.3.0. I've upgraded a few of them to 5.9, but it will take a long time to validate whether this still happens. And It is not validated on CPython too. I'm also trying to collect more debugging information for this issue, but it is very hard since it rarely reproduces. It is not a serious issue. It could be workarounded with a auto-restart, but I'm searching the cause. 2017-12-18 hubo _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
Updating: This issue is reproduced on PyPy 5.9. I will collect running information (local variables, etc.) for hints. 2018-01-05 hubo 发件人:"hubo" <hubo@jiedaibao.com> 发送时间:2017-12-22 09:55 主题:Re: [pypy-dev] Mysterious IndexError in service running with PyPy 收件人:"William ML Leslie"<william.leslie.ttg@gmail.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> It is a regular guess, but this method is guareenteed to be called only in the main thread. Thread pool usage is very limited in this program because it is coroutine-based. 2017-12-22 hubo 发件人:William ML Leslie <william.leslie.ttg@gmail.com> 发送时间:2017-12-19 20:14 主题:Re: [pypy-dev] Mysterious IndexError in service running with PyPy 收件人:"hubo"<hubo@jiedaibao.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Where is the code that changes the size of self.heap? How do we know that size(self.heap) is constant? My guess is that some thread changes this; but l is not recomputed. On 18 Dec 2017 6:59 PM, "hubo" <hubo@jiedaibao.com> wrote: I'm reporting this issue in this mail group, though I don't know if it is related with PyPy, because it is really strange, and is not able to reproduce stably. But I hope someone may know the reason or have some points. I'm running some SDN services written in Python with PyPy 5.3.0. The related code is here: https://github.com/hubo1016/vlcp/blob/8022e3a3c67cf4305af503d507640a730ca394... The full code is also in the repo, but may be too complex to describe. But this related piece is quite simple: def _siftdown(self, pos): temp = self.heap[pos] l = len(self.heap) while pos * 2 + 1 < l: cindex = pos * 2 + 1 pt = self.heap[cindex] if cindex + 1 < l and self.heap[cindex+1][0] < pt[0]: cindex = cindex + 1 pt = self.heap[cindex] if pt[0] < temp[0]: self.heap[pos] = pt self.index[pt[1]] = pos else: break pos = cindex self.heap[pos] = temp self.index[temp[1]] = pos It is a simple heap operation. The service uses a heap to process timers. When the service is not busy, it usually runs this piece of code several times per minute. I have 32 servers running this service. They are quite stable in about three months, but one day one of the services crashes on line 100 reporting IndexError: pt = self.heap[cindex] As you can see, cindex = pos * 2 + 1, which is tested by the while pre-conditon just two lines before. And there is not any multi-threading issues here because this piece of code always runs in the same thread. So it is not possible in theory for this to happen. Only the following facts are known about this issue: It reproduces - through quite rarely. I've met this kind of crashes 4 times, each with different machine, so it should not be related to hardware issuess. Since I've got 32 servers, it might take more than one year to reproduce with a single server. It seems not to be related to pressures. All of the crashes happens at night, when there are little requests. Only some cleanup tasks are running in fixed interval. The services are running with PyPy 5.3.0. I've upgraded a few of them to 5.9, but it will take a long time to validate whether this still happens. And It is not validated on CPython too. I'm also trying to collect more debugging information for this issue, but it is very hard since it rarely reproduces. It is not a serious issue. It could be workarounded with a auto-restart, but I'm searching the cause. 2017-12-18 hubo _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
Update: I think I found the cause, happily it is not a PyPy bug (although it is PyPy-related). Some generators are calling some cleanup code in the "finally" clause, but they are not correctly collected in PyPy until GC (because PyPy does not use reference counting). Apparently when they are collected by GC module, the "finally" clause is executed in a separated thread. 2018-05-22 hubo 发件人:"hubo" <hubo@jiedaibao.com> 发送时间:2017-12-22 09:55 主题:Re: [pypy-dev] Mysterious IndexError in service running with PyPy 收件人:"William ML Leslie"<william.leslie.ttg@gmail.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> It is a regular guess, but this method is guareenteed to be called only in the main thread. Thread pool usage is very limited in this program because it is coroutine-based. 2017-12-22 hubo 发件人:William ML Leslie <william.leslie.ttg@gmail.com> 发送时间:2017-12-19 20:14 主题:Re: [pypy-dev] Mysterious IndexError in service running with PyPy 收件人:"hubo"<hubo@jiedaibao.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Where is the code that changes the size of self.heap? How do we know that size(self.heap) is constant? My guess is that some thread changes this; but l is not recomputed. On 18 Dec 2017 6:59 PM, "hubo" <hubo@jiedaibao.com> wrote: I'm reporting this issue in this mail group, though I don't know if it is related with PyPy, because it is really strange, and is not able to reproduce stably. But I hope someone may know the reason or have some points. I'm running some SDN services written in Python with PyPy 5.3.0. The related code is here: https://github.com/hubo1016/vlcp/blob/8022e3a3c67cf4305af503d507640a730ca394... The full code is also in the repo, but may be too complex to describe. But this related piece is quite simple: def _siftdown(self, pos): temp = self.heap[pos] l = len(self.heap) while pos * 2 + 1 < l: cindex = pos * 2 + 1 pt = self.heap[cindex] if cindex + 1 < l and self.heap[cindex+1][0] < pt[0]: cindex = cindex + 1 pt = self.heap[cindex] if pt[0] < temp[0]: self.heap[pos] = pt self.index[pt[1]] = pos else: break pos = cindex self.heap[pos] = temp self.index[temp[1]] = pos It is a simple heap operation. The service uses a heap to process timers. When the service is not busy, it usually runs this piece of code several times per minute. I have 32 servers running this service. They are quite stable in about three months, but one day one of the services crashes on line 100 reporting IndexError: pt = self.heap[cindex] As you can see, cindex = pos * 2 + 1, which is tested by the while pre-conditon just two lines before. And there is not any multi-threading issues here because this piece of code always runs in the same thread. So it is not possible in theory for this to happen. Only the following facts are known about this issue: It reproduces - through quite rarely. I've met this kind of crashes 4 times, each with different machine, so it should not be related to hardware issuess. Since I've got 32 servers, it might take more than one year to reproduce with a single server. It seems not to be related to pressures. All of the crashes happens at night, when there are little requests. Only some cleanup tasks are running in fixed interval. The services are running with PyPy 5.3.0. I've upgraded a few of them to 5.9, but it will take a long time to validate whether this still happens. And It is not validated on CPython too. I'm also trying to collect more debugging information for this issue, but it is very hard since it rarely reproduces. It is not a serious issue. It could be workarounded with a auto-restart, but I'm searching the cause. 2017-12-18 hubo _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
participants (2)
-
hubo
-
William ML Leslie