Hi all, First off, I must admit limited abilities with debugging segfaults, so apologies for that. I run a configuration management server bcfg2 which is python based heavily utilising python-lxml. The server runs debian wheezy (recently squeeze which exhibited the problem too). It appears on the face of it that it's segfaulting during a python genshi template which reads in some XML. Please see the gdb analysis attached. Any advice or pointers or next steps in how to determine the cause of the segfault would be greatly appreciated. Thanks in advance. Matt -- Matthew Baker :: Unix/Security Team Lead Infrastructure, Systems and Operations @University of Bristol Team email: it-sysops@bristol.ac.uk Tel: +44(0)117 3317467 Add: Uni of Bristol, Computer Centre, Tyndal Ave, Bristol. BS8 1UD
Hi, thanks for the report and the details. Matt Baker, 18.12.2013 18:23:
First off, I must admit limited abilities with debugging segfaults, so apologies for that.
I run a configuration management server bcfg2 which is python based heavily utilising python-lxml. The server runs debian wheezy (recently squeeze which exhibited the problem too).
It appears on the face of it that it's segfaulting during a python genshi template which reads in some XML.
Please see the gdb analysis attached. Any advice or pointers or next steps in how to determine the cause of the segfault would be greatly appreciated.
Wheezy has lxml 2.3.2, that's pretty old. Could you try the latest release? Stefan
Hi again. sorry the last reply to this thread didn't get to me so didn't see until I searched the archives. So apologies for late response. I've upgraded to a backported copy of the version in Debian experimental http://packages.debian.org/experimental/python-lxml 3.3.0~beta2-1 which is one patch release behind the current release (beta3). I still get a segfault. Please see the full gdb output attached. A quick look through the output seems to suggest it's happened (this time) due to a call to element.getparent()? Thanks in advance. Matt On 18 December 2013 17:23, Matt Baker <Matt.Baker@bristol.ac.uk> wrote:
Hi all,
First off, I must admit limited abilities with debugging segfaults, so apologies for that.
I run a configuration management server bcfg2 which is python based heavily utilising python-lxml. The server runs debian wheezy (recently squeeze which exhibited the problem too).
It appears on the face of it that it's segfaulting during a python genshi template which reads in some XML.
Please see the gdb analysis attached. Any advice or pointers or next steps in how to determine the cause of the segfault would be greatly appreciated.
Thanks in advance.
Matt
-- Matthew Baker :: Unix/Security Team Lead Infrastructure, Systems and Operations @University of Bristol Team email: it-sysops@bristol.ac.uk Tel: +44(0)117 3317467 Add: Uni of Bristol, Computer Centre, Tyndal Ave, Bristol. BS8 1UD
-- Matthew Baker :: Unix/Security Team Lead Infrastructure, Systems and Operations @University of Bristol Team email: it-sysops@bristol.ac.uk Tel: +44(0)117 3317467 Add: Uni of Bristol, Computer Centre, Tyndal Ave, Bristol. BS8 1UD
Matt Baker, 09.01.2014 18:53:
On 18 December 2013 17:23, Matt Baker wrote:
First off, I must admit limited abilities with debugging segfaults, so apologies for that.
I run a configuration management server bcfg2 which is python based heavily utilising python-lxml. The server runs debian wheezy (recently squeeze which exhibited the problem too).
It appears on the face of it that it's segfaulting during a python genshi template which reads in some XML.
Please see the gdb analysis attached. Any advice or pointers or next steps in how to determine the cause of the segfault would be greatly appreciated.
I've upgraded to a backported copy of the version in Debian experimental http://packages.debian.org/experimental/python-lxml
3.3.0~beta2-1 which is one patch release behind the current release (beta3).
I still get a segfault. Please see the full gdb output attached. A quick look through the output seems to suggest it's happened (this time) due to a call to element.getparent()?
Difficult to say what might go wrong here, but the C code line where it crashes should be this one: __pyx_t_1 = (((PyObject *)__pyx_v_doc->_parser->_class_lookup) != Py_None); You didn't copy the actual error message of gdb, but seeing the stack trace and that the pointer value of "__pyx_v_doc" doesn't look unreasonable, my guess is that "__pyx_v_doc->_parser" isn't what it's supposed to be, i.e. the link from the document to the parser that parsed it isn't properly set up. Your gdb seems to have Python support, so it would be great if you could dig around in those Python objects a bit. Could you figure out how this document was parsed, using what API functionality? Is it completely parsed at the point of the crash or is it using something like iterparse() and hasn't finished yet? The Python-level stack trace might give a hint here. Do you have any idea at what position in the document it currently is, i.e. what Element object it is looking at, and what its parents are that the code is trying to look up? Could you provide a relevant snippet of that document? Is it always the same document where it crashes? Is there anything special about it? Stefan
Hi Stefan, Many thanks for your reply. Comments below. On 9 January 2014 19:39, Stefan Behnel <stefan_ml@behnel.de> wrote:
I still get a segfault. Please see the full gdb output attached. A quick
look through the output seems to suggest it's happened (this time) due to a call to element.getparent()?
Difficult to say what might go wrong here, but the C code line where it crashes should be this one:
__pyx_t_1 = (((PyObject *)__pyx_v_doc->_parser->_class_lookup) != Py_None);
You didn't copy the actual error message of gdb, but seeing the stack trace and that the pointer value of "__pyx_v_doc" doesn't look unreasonable, my guess is that "__pyx_v_doc->_parser" isn't what it's supposed to be, i.e. the link from the document to the parser that parsed it isn't properly set up. Your gdb seems to have Python support, so it would be great if you could dig around in those Python objects a bit.
Could you figure out how this document was parsed, using what API functionality? Is it completely parsed at the point of the crash or is it using something like iterparse() and hasn't finished yet? The Python-level stack trace might give a hint here.
Loading XML files is an inherent property of bcfg2. It seems to stem originally from this call self.xdata = lxml.etree.XML(self.data, base_url=self.name, parser=Bcfg2.Server.XMLParser) where self.data is a file open: self.data = open(self.name).read() where base_url is the file name and where Bcfg2.Server.XMLParser = lxml.etree.XMLParser(remove_blank_text=True) https://github.com/Bcfg2/bcfg2/blob/maint/src/lib/Bcfg2/Server/Plugin/helper... The Class XMLFileBacked has a method Index() which is called when a file change is detected by a file monitor (inotify in our case). So it looks like it's read in in entirety. Do you have any idea at what position in the document it currently is, i.e.
what Element object it is looking at, and what its parents are that the code is trying to look up? Could you provide a relevant snippet of that document? Is it always the same document where it crashes? Is there anything special about it?
I do: <Project posixuser="ebl" posixgroup="ebl" name="ebl" description="eBiolabs"> <Contact role="service-manager" name="sh0923"/> <Contact role="service-manager" name="jn13044"/> <Instance status="production" name="prod"> <Application type="apache" affinity="w-php-p16.soms.bris.ac.uk"> <VirtualHost default="true" ipaddr="137.222.7.203" name=" ebl.soms.bris.ac.uk" port="80"/> <VirtualHost ipaddr="137.222.7.203" name="ebl-prod.soms.bris.ac.uk" port="80"/> <VirtualHost ssl="true" ipaddr="137.222.7.203" name=" ebl.soms.bris.ac.uk" port="443"/> </Application> </Instance> <Instance status="development" name="dev"> ----> <Application type="apache" affinity="w-php-d8.soms.bris.ac.uk"> <VirtualHost default="true" ipaddr="137.222.7.198" name=" ebl-dev.soms.bris.ac.uk" port="80"/> </Application> </Instance> <Contact role="service-manager" name="ip13705"/> </Project> It doesn't appear to crash at the same place. I have another segfault today. I'll attach in the gdb output. There are plenty of other element structures like this one in this file. It's not particularly special from what I can see. Thanks for the help. Matt -- Matthew Baker :: Unix/Security Team Lead Infrastructure, Systems and Operations @University of Bristol Team email: it-sysops@bristol.ac.uk Tel: +44(0)117 3317467 Add: Uni of Bristol, Computer Centre, Tyndal Ave, Bristol. BS8 1UD
Hi, thanks for the insights. Matt Baker, 10.01.2014 14:22:
On 9 January 2014 19:39, Stefan Behnel wrote:
I still get a segfault. Please see the full gdb output attached. A quick
look through the output seems to suggest it's happened (this time) due to a call to element.getparent()?
Difficult to say what might go wrong here, but the C code line where it crashes should be this one:
__pyx_t_1 = (((PyObject *)__pyx_v_doc->_parser->_class_lookup) != Py_None);
You didn't copy the actual error message of gdb, but seeing the stack trace and that the pointer value of "__pyx_v_doc" doesn't look unreasonable, my guess is that "__pyx_v_doc->_parser" isn't what it's supposed to be, i.e. the link from the document to the parser that parsed it isn't properly set up. Your gdb seems to have Python support, so it would be great if you could dig around in those Python objects a bit.
Could you figure out how this document was parsed, using what API functionality? Is it completely parsed at the point of the crash or is it using something like iterparse() and hasn't finished yet? The Python-level stack trace might give a hint here.
Loading XML files is an inherent property of bcfg2. It seems to stem originally from this call
self.xdata = lxml.etree.XML(self.data, base_url=self.name, parser=Bcfg2.Server.XMLParser)
where self.data is a file open: self.data = open(self.name).read()
where base_url is the file name
and where Bcfg2.Server.XMLParser = lxml.etree.XMLParser(remove_blank_text=True)
https://github.com/Bcfg2/bcfg2/blob/maint/src/lib/Bcfg2/Server/Plugin/helper...
The Class XMLFileBacked has a method Index() which is called when a file change is detected by a file monitor (inotify in our case).
So it looks like it's read in in entirety.
Ok, what I think this does is as follows, correct me if I'm wrong. It parses XML documents in different threads and caches them. Then it processes the cached documents, potentially in other threads. Is that how it works? It also seems to reuse a globally defined parser, which means that the parsing itself is serialised across threads (a parser instance isn't re-entrant). This is a bit of a bad design because it means that lxml needs to do a lot of work to adapt documents across threads. It would be much more efficient (and likely also safer) if the documents were cached locally in each thread. Or even cached globally as strings and then parsed separately by each thread. Parsing is actually fairly cheap. I'm saying "likely also safer" because doing these adaptations across threads may lead to concurrent modifications in different threads. There isn't currently a good locking mechanism in place that would prevent them, so triggering these thread adaptations concurrently is pretty loudly asking for trouble. If the above is more or less how things work, then my advice would be to change bcfg2 to either a) not cache the trees at all, or b) cache them in serialised form (i.e. as byte strings) and let each thread freshly parse them, or c) use independent thread-local caches for them, or d) deep copy them when a thread requests them from the cache. Deep copying is also very cheap, even cheaper than parsing. Depending on how large these trees are and how often they are requested (strings are much more memory efficient than trees), I'd try b) or d) first. Does this help? Stefan
Hi Stefan, many thanks for the comments. I've pointed the main bcfg2 devs at the thread and they had some opinions on how some of it could be implemented but I think it's going to require some heavy changes. I'll come back to you if I have any further questions. Thanks for your help. Matt On 10 January 2014 17:55, Stefan Behnel <stefan_ml@behnel.de> wrote:
Hi,
thanks for the insights.
On 9 January 2014 19:39, Stefan Behnel wrote:
I still get a segfault. Please see the full gdb output attached. A quick
look through the output seems to suggest it's happened (this time) due to a call to element.getparent()?
Difficult to say what might go wrong here, but the C code line where it crashes should be this one:
__pyx_t_1 = (((PyObject *)__pyx_v_doc->_parser->_class_lookup) != Py_None);
You didn't copy the actual error message of gdb, but seeing the stack
Matt Baker, 10.01.2014 14:22: trace
and that the pointer value of "__pyx_v_doc" doesn't look unreasonable, my guess is that "__pyx_v_doc->_parser" isn't what it's supposed to be, i.e. the link from the document to the parser that parsed it isn't properly set up. Your gdb seems to have Python support, so it would be great if you could dig around in those Python objects a bit.
Could you figure out how this document was parsed, using what API functionality? Is it completely parsed at the point of the crash or is it using something like iterparse() and hasn't finished yet? The Python-level stack trace might give a hint here.
Loading XML files is an inherent property of bcfg2. It seems to stem originally from this call
self.xdata = lxml.etree.XML(self.data, base_url=self.name, parser=Bcfg2.Server.XMLParser)
where self.data is a file open: self.data = open(self.name).read()
where base_url is the file name
and where Bcfg2.Server.XMLParser = lxml.etree.XMLParser(remove_blank_text=True)
https://github.com/Bcfg2/bcfg2/blob/maint/src/lib/Bcfg2/Server/Plugin/helper...
The Class XMLFileBacked has a method Index() which is called when a file change is detected by a file monitor (inotify in our case).
So it looks like it's read in in entirety.
Ok, what I think this does is as follows, correct me if I'm wrong.
It parses XML documents in different threads and caches them. Then it processes the cached documents, potentially in other threads. Is that how it works?
It also seems to reuse a globally defined parser, which means that the parsing itself is serialised across threads (a parser instance isn't re-entrant).
This is a bit of a bad design because it means that lxml needs to do a lot of work to adapt documents across threads. It would be much more efficient (and likely also safer) if the documents were cached locally in each thread. Or even cached globally as strings and then parsed separately by each thread. Parsing is actually fairly cheap.
I'm saying "likely also safer" because doing these adaptations across threads may lead to concurrent modifications in different threads. There isn't currently a good locking mechanism in place that would prevent them, so triggering these thread adaptations concurrently is pretty loudly asking for trouble.
If the above is more or less how things work, then my advice would be to change bcfg2 to either a) not cache the trees at all, or b) cache them in serialised form (i.e. as byte strings) and let each thread freshly parse them, or c) use independent thread-local caches for them, or d) deep copy them when a thread requests them from the cache. Deep copying is also very cheap, even cheaper than parsing.
Depending on how large these trees are and how often they are requested (strings are much more memory efficient than trees), I'd try b) or d) first.
Does this help?
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Matthew Baker :: Unix/Security Team Lead Infrastructure, Systems and Operations @University of Bristol Team email: it-sysops@bristol.ac.uk Tel: +44(0)117 3317467 Add: Uni of Bristol, Computer Centre, Tyndal Ave, Bristol. BS8 1UD
participants (2)
-
Matt Baker
-
Stefan Behnel