[lxml-dev] Critical crashes on Windows under high load
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Hi folks, We have an incredibly frustrating, show-stopping problem using lxml (under Deliverance, in front of a repoze.zope2 pipeline serving up a Plone site) on Windows. Under high load, the Python process crashes. There is no traceback in the log, so I can't identify where it actually happens, but we get a Windows error dialogue saying python.exe (or pythonservice.exe if running as a Windows service) has crashed in etree.pyd (at some binary address, no line numbers or function references). The Deliverance (0.3/trunk) rules use fairly complex xpath expressions. We're trying to simplify these, but there's nothing obviously wrong, and in any case it shouldn't crash. We've tried to run both multi-threaded and single-threaded 'paster' processes: the problem happens with both. I did read somewhere that it's possible to build a single-threaded lxml egg (?), but I haven't found one. We would be incredibly grateful for any help with (a) debugging and (b) resolving this. At present, we're having to fight a lot of nervousness regarding the production-worthiness of our Deliverance/lxml based solution, which is rather unfortunate. :-( Cheers, Martin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 02.11.2009 03:58:
We have an incredibly frustrating, show-stopping problem using lxml
I assume you are using lxml 2.2.2?
[...] on Windows.
And now we have two problems...
I do not build the Windows binaries myself, so I have no idea if there are any debug symbols in there. Would certainly be nice to have them.
XPath shouldn't crash by itself, so I'd rather focus the debugging on the other things you are doing. Are you running the XPath queries against trees that are being modified concurrently? Did you check for memory problems? Could you try to come up with a stripped down set of operations that your code does using lxml? And which of them happen concurrently?
We've tried to run both multi-threaded and single-threaded 'paster' processes: the problem happens with both.
Does that mean that this happens even if you run everything single-threaded?
Certainly. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Yes, though we also tried the latest in the 2.0.x line as a downgrade for a bit. Same problem.
Who does? Sidnei?
It's possible that Deliverance is doing something evil here, but I kind of doubt it. As far as I can tell, this is a Windows-specific problem, or at least no-one seems to have reported it on Unix.
Did you check for memory problems?
How would I do that?
Could you try to come up with a stripped down set of operations that your code does using lxml? And which of them happen concurrently?
I'm not sure. It'd be difficult. The crash dialogue doesn't tell me where in lxml the problem is (since there's no stack trace). Deliverance is doing a fair amount of work with lxml (evaluating xpath expressions, parsing the two input trees (theme + content), modifying the output tree). So far, we've not been able to pinpoint exactly where it happens, or if it's even deterministic.
We put the paster processes under which the WSGI pipeline runs into single threaded mode (or at least, we set the threadpool size of each process to 1), so in theory, there shouldn't be any concurrency. I don't know if that's actually the case, though. I guess the most constructive thing would be if I could find some better way of debugging this. People closer to the project (and server) where this is happening are working on a load test suite that can reproduce this reliably, though it's pretty much trial and error. The problem is that as of right now, I don't know what I'd do next even if they did make it occur reliably. I don't understand how lxml is built, how Cython works, how to write C extensions, or how to do C development on Windows. It's a loooong time since I wrote C/C++ and that was on Linux. ;-) Martin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 02.11.2009 15:24:
Yes.
So I assume you ran similar load tests under Unix systems?
Did you check for memory problems?
How would I do that?
I mean, does the process' memory usage grow uncontrolled? If it's running out of memory, it's quite possible that it crashes. Not all memory errors can be handled safely.
Who said debugging would come for free?
Is that one tree per thread or are trees being handled by multiple threads? If threads don't share data, it can't be a threading issue (at least not from the POV of lxml).
It would be helpful if you could find out. In the worst case, you can inject a WSGI layer that simply acquires a lock while it forwards the request. Then you're sure it's single threaded.
Well, at least, if it can be reproduced, it can be tracked down and fixed.
Luckily, you don't have to. lxml is written in Cython, not in C. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
No, I wish we could. :( I'm basing this on the fact that (a) Unix deployments seem more common (b) no-one has reported this on Unix that I can see and (c) I've found at least one other person with Windows crashes. But who knows, I could be completely wrong. What I can say for certain is that the crashes do occur from time to time under relatively normal usage patterns.
We normally discover the error only after the process has crashed. There's no pre-warning. It looks like memory usage is relatively stable when the system is running normally. I'll try to take a closer look, though.
Heh, true. A *lot* of time has gone into this already. We're talking about a fairly big stack here, though. What I think we try, though is to attempt to reproduce the problem with a load test suite and a static back end instead of having Plone in the mix. That should produce a relatively small WSGI pipeline and a manageable amount of code. If it still crashes, of course.
One per thread almost certainly. They're read on each request as far as I can tell. I'd have to defer to the Deliverance developers, though.
Does anyone know? We're using Paste#httpserver and set threadpool_count = 1. I assume that means single threaded?
Yeah. That's basically what we're working towards now. But it's not straightforward, at least not in a way that we can give to other people to look at.
But libxml2 and libxslt are. I suppose it's conceivable the problem is there, or in the way they're statically linked perhaps? Not that I understand Cython either. ;-) Thanks for your help! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Unfortunately not. We tried to simplify the xpath expressions, but it still crashed (perhaps a bit less often). Our "solution" was to ditch Deliverance in favour of collective.xdv, which still uses lxml, but uses the XDV XSLT-based transformation process. So now, we're only using lxml to execute two XSLT files (the first one generates the second). Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 02.11.2009 03:58:
We have an incredibly frustrating, show-stopping problem using lxml
I assume you are using lxml 2.2.2?
[...] on Windows.
And now we have two problems...
I do not build the Windows binaries myself, so I have no idea if there are any debug symbols in there. Would certainly be nice to have them.
XPath shouldn't crash by itself, so I'd rather focus the debugging on the other things you are doing. Are you running the XPath queries against trees that are being modified concurrently? Did you check for memory problems? Could you try to come up with a stripped down set of operations that your code does using lxml? And which of them happen concurrently?
We've tried to run both multi-threaded and single-threaded 'paster' processes: the problem happens with both.
Does that mean that this happens even if you run everything single-threaded?
Certainly. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Yes, though we also tried the latest in the 2.0.x line as a downgrade for a bit. Same problem.
Who does? Sidnei?
It's possible that Deliverance is doing something evil here, but I kind of doubt it. As far as I can tell, this is a Windows-specific problem, or at least no-one seems to have reported it on Unix.
Did you check for memory problems?
How would I do that?
Could you try to come up with a stripped down set of operations that your code does using lxml? And which of them happen concurrently?
I'm not sure. It'd be difficult. The crash dialogue doesn't tell me where in lxml the problem is (since there's no stack trace). Deliverance is doing a fair amount of work with lxml (evaluating xpath expressions, parsing the two input trees (theme + content), modifying the output tree). So far, we've not been able to pinpoint exactly where it happens, or if it's even deterministic.
We put the paster processes under which the WSGI pipeline runs into single threaded mode (or at least, we set the threadpool size of each process to 1), so in theory, there shouldn't be any concurrency. I don't know if that's actually the case, though. I guess the most constructive thing would be if I could find some better way of debugging this. People closer to the project (and server) where this is happening are working on a load test suite that can reproduce this reliably, though it's pretty much trial and error. The problem is that as of right now, I don't know what I'd do next even if they did make it occur reliably. I don't understand how lxml is built, how Cython works, how to write C extensions, or how to do C development on Windows. It's a loooong time since I wrote C/C++ and that was on Linux. ;-) Martin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Aspeli, 02.11.2009 15:24:
Yes.
So I assume you ran similar load tests under Unix systems?
Did you check for memory problems?
How would I do that?
I mean, does the process' memory usage grow uncontrolled? If it's running out of memory, it's quite possible that it crashes. Not all memory errors can be handled safely.
Who said debugging would come for free?
Is that one tree per thread or are trees being handled by multiple threads? If threads don't share data, it can't be a threading issue (at least not from the POV of lxml).
It would be helpful if you could find out. In the worst case, you can inject a WSGI layer that simply acquires a lock while it forwards the request. Then you're sure it's single threaded.
Well, at least, if it can be reproduced, it can be tracked down and fixed.
Luckily, you don't have to. lxml is written in Cython, not in C. Stefan
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
No, I wish we could. :( I'm basing this on the fact that (a) Unix deployments seem more common (b) no-one has reported this on Unix that I can see and (c) I've found at least one other person with Windows crashes. But who knows, I could be completely wrong. What I can say for certain is that the crashes do occur from time to time under relatively normal usage patterns.
We normally discover the error only after the process has crashed. There's no pre-warning. It looks like memory usage is relatively stable when the system is running normally. I'll try to take a closer look, though.
Heh, true. A *lot* of time has gone into this already. We're talking about a fairly big stack here, though. What I think we try, though is to attempt to reproduce the problem with a load test suite and a static back end instead of having Plone in the mix. That should produce a relatively small WSGI pipeline and a manageable amount of code. If it still crashes, of course.
One per thread almost certainly. They're read on each request as far as I can tell. I'd have to defer to the Deliverance developers, though.
Does anyone know? We're using Paste#httpserver and set threadpool_count = 1. I assume that means single threaded?
Yeah. That's basically what we're working towards now. But it's not straightforward, at least not in a way that we can give to other people to look at.
But libxml2 and libxslt are. I suppose it's conceivable the problem is there, or in the way they're statically linked perhaps? Not that I understand Cython either. ;-) Thanks for your help! Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
data:image/s3,"s3://crabby-images/aa9ec/aa9ec767ed8c595307427579bb56a63f8f61657e" alt=""
Stefan Behnel wrote:
Unfortunately not. We tried to simplify the xpath expressions, but it still crashed (perhaps a bit less often). Our "solution" was to ditch Deliverance in favour of collective.xdv, which still uses lxml, but uses the XDV XSLT-based transformation process. So now, we're only using lxml to execute two XSLT files (the first one generates the second). Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book
participants (2)
-
Martin Aspeli
-
Stefan Behnel