[lxml-dev] 'docinfo' property on ElementTree
Hi all, I updated the trunk to provide ElementTree objects with access to the document information provided by the parser: DOCTYPE, XML version and original encoding. Paul Everitt had some use cases related to the HTML parser, but I think it's generally a good idea to make this kind of information available. The new API works as follows:
pub_id = "-//W3C//DTD XHTML 1.0 Transitional//EN" sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" doctype_string = '<!DOCTYPE html PUBLIC "%s" "%s">' % (pub_id, sys_url) xml_header = '<?xml version="1.0" encoding="ascii"?>' xhtml = xml_header + doctype_string + '<html><body></body></html>'
et = lxml.etree.parse(StringIO(xhtml)) docinfo = et.docinfo print docinfo.public_id -//W3C//DTD XHTML 1.0 Transitional//EN print docinfo.system_url http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd docinfo.doctype == doctype_string True print docinfo.xml_version 1.0 print docinfo.encoding ascii
This is backed by a DocInfo object that you can also instantiate on an ElementTree (or Element) by hand. The docinfo property just does it for you. Any of the attributes above may be None if the information is not available. Stefan
participants (1)
-
Stefan Behnel