[lxml-dev] Segmentation fault in lxml.html after pickling
This script crashes with a segmentation fault :/ Using Python-2.5.2, libxslt-1.1.9, libxml2-2.6.32, lxml-2.1beta2, linux-i686 #!/usr/bin/python # coding=utf-8 import cPickle import lxml, lxml.html html = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Test Page</title> </head> <body> Test Page </body> </html>''' tree = lxml.html.fromstring(html) cf = open('test.pcl', 'w') cPickle.dump(tree, cf, -1) cf.close() cf = open('test.pcl', 'r') pickled_tree = cPickle.load(cf) cf.close() print 'This works fine...' lxml.html.tostring(tree) print 'This crashes...' lxml.html.tostring(pickled_tree)
Marcel Hellkamp wrote:
This script crashes with a segmentation fault :/
import cPickle [...] tree = lxml.html.fromstring(html)
cf = open('test.pcl', 'w') cPickle.dump(tree, cf, -1) cf.close()
cf = open('test.pcl', 'r') pickled_tree = cPickle.load(cf) cf.close()
Yes, you can't pickle Elements in lxml.etree. This feature is currently only available in lxml.objectify, where Elements behave a lot more like Python objects. I think it makes a little less sense in lxml.etree where you'd have to keep some more state about the Element classes used inside the tree. I'm not sure how valuable this is in lxml.html. Could you describe your use case a little? Stefan
Stefan Behnel wrote: [snip]
I think it makes a little less sense in lxml.etree where you'd have to keep some more state about the Element classes used inside the tree. I'm not sure how valuable this is in lxml.html.
I'd love it if I could somehow store lxml trees in the ZODB, and that'd need pickle support. Whether it could be made to be efficient I don't know - you'd not want the whole tree to be pickled as a whole in case of large trees, but some form of partitioning scheme into separate pickles. You're right that custom-element binding would be nice in this case, and that means the pickle can't simply be the XML content unless it's somehow annotated first. Anyway, this is a rather out there use case. I am just intrigued to learn that objectify elements can be pickled. Regards, Martijn
Martijn Faassen wrote:
I'd love it if I could somehow store lxml trees in the ZODB, and that'd need pickle support. Whether it could be made to be efficient I don't know - you'd not want the whole tree to be pickled as a whole in case of large trees, but some form of partitioning scheme into separate pickles. You're right that custom-element binding would be nice in this case, and that means the pickle can't simply be the XML content unless it's somehow annotated first.
Anyway, this is a rather out there use case. I am just intrigued to learn that objectify elements can be pickled.
It's just easier to do in objectify, as it has a pretty comprehensive setup for Element class mapping. If you want to be sure to get back exactly the same Element tree after pickling, you can just annotate() an objectify tree before pickling it. Doing the same thing in lxml.etree would require storing some information about the current Element lookup, which may be a lot of information, e.g. for the namespace class setup. That's a parser-local setup, so we can't just use the setup of the default parser either but need a concrete context for the unpickling. lxml.html might be considered having such a context in a similar way lxml.objectify has it, as it comes with its own classes and lookup scheme. Stefan
Martijn Faassen wrote:
I'd love it if I could somehow store lxml trees in the ZODB, and that'd need pickle support. Whether it could be made to be efficient I don't know - you'd not want the whole tree to be pickled as a whole in case of large trees, but some form of partitioning scheme into separate pickles. You're right that custom-element binding would be nice in this case, and that means the pickle can't simply be the XML content unless it's somehow annotated first.
Anyway, this is a rather out there use case. I am just intrigued to learn that objectify elements can be pickled.
It's just easier to do in objectify, as it has a pretty comprehensive setup for Element class mapping. If you want to be sure to get back exactly the same Element tree after pickling, you can just annotate() an objectify tree before pickling it. Doing the same thing in lxml.etree would require storing some information about the current Element lookup, which may be a lot of information, e.g. for the namespace class setup. That's a parser-local setup, so we can't just use the setup of the default parser either but need a concrete context for the unpickling. lxml.html might be considered having such a context in a similar way lxml.objectify has it, as it comes with its own classes and lookup scheme. Stefan
Stefan Behnel wrote:
Martijn Faassen wrote:
I'd love it if I could somehow store lxml trees in the ZODB, and that'd need pickle support. Whether it could be made to be efficient I don't know - you'd not want the whole tree to be pickled as a whole in case of large trees, but some form of partitioning scheme into separate pickles. You're right that custom-element binding would be nice in this case, and that means the pickle can't simply be the XML content unless it's somehow annotated first.
Anyway, this is a rather out there use case. I am just intrigued to learn that objectify elements can be pickled.
It's just easier to do in objectify, as it has a pretty comprehensive setup for Element class mapping. If you want to be sure to get back exactly the same Element tree after pickling, you can just annotate() an objectify tree before pickling it.
Doing the same thing in lxml.etree would require storing some information about the current Element lookup, which may be a lot of information, e.g. for the namespace class setup. That's a parser-local setup, so we can't just use the setup of the default parser either but need a concrete context for the unpickling.
lxml.html might be considered having such a context in a similar way lxml.objectify has it, as it comes with its own classes and lookup scheme.
Just what would end up being pickled, do you think? The entire document? A first thought is that the document gets pickled, and then the element is an offset in that document. Like, erm... class HtmlMixin: def __getstate__(self): return (self.getroottree(), self._indexes_to_self()) def _indexes_to_self(self): result = [] el = self while el.getparent(): result.insert(0, el.getparent().index(el)) el = el.getparent() return result def __setstate__(self, state): # Dammit... this doesn't actually work: doc, indexes_to_self = state el = doc.getroot() for index in indexes_to_self: el = el[index] return el There is no return value for __setstate__, and no way to indicate a constructor method for creating instances. That's dumb. I don't like pickle. For documents, if the pickle hooks worked reasonably I'd just store the serialization of the document (as a string) plus all the special attributes (doctype, url, etc). Given that the hooks don't work reasonably I'm not sure how to do it; maybe people with the ZODB experience to have hit this problem would have an idea? From what I can tell there's no reason to store the document as anything but a string -- serializing and re-parsing the string is faster than any other means of storing a document (it all ends up as strings eventually anyway). -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org
Ian Bicking wrote:
A first thought is that the document gets pickled, and then the element is an offset in that document.
That's a brilliant idea, but why so complicated? :) pickle: doc = self.getroottree() return (tostring(doc), doc.getpath(self)) unpickle: doc, path = pickle_value return doc.xpath(path) would do the trick. Maybe we should serialise as XML instead of HTML, so that we don't run into any "relaxed parser" problems (I remember a not so old libxml2 HTML serialiser bug with <embed> roundtrips, for example).
There is no return value for __setstate__, and no way to indicate a constructor method for creating instances. That's dumb. I don't like pickle.
:) You don't have to use __[sg]etstate__(). You can define an external function to do it for you, just like objectify does (search src/lxml/lxml.objectify.pyx for "pickle"). The stupid thing is that this function has to be registered /and/ public. It's not enough to register it and delete it afterwards... Still, the problem remains that we need to assure we keep the element lookup context, so this is still not a general solution for lxml.etree. But it should be suitable for lxml.html. Stefan
participants (4)
-
Ian Bicking
-
Marcel Hellkamp
-
Martijn Faassen
-
Stefan Behnel