[lxml-dev] lxml2.2 doctype missing
data:image/s3,"s3://crabby-images/11c09/11c09245c5bb382762a9f8e8ffea9f04a3a43820" alt=""
I noticed that the xhtml converted from the parse tree has doctype missing. I am using lxml 2.2. Is this bug still not fixed in lxml 2.2 ? -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Mary Lei wrote:
I noticed that the xhtml converted from the parse tree has doctype missing. I am using lxml 2.2.
Is this bug still not fixed in lxml 2.2 ?
In order to convince others that this is a bug, you might want to provide some more information. Could you present a short code snippet that shows what you do and the (unexpected) result you get? Stefan
data:image/s3,"s3://crabby-images/11c09/11c09245c5bb382762a9f8e8ffea9f04a3a43820" alt=""
here is an example: #!/bin/sh # next line restarts python \ "exec" "python" "-O" "$0" "$@" import urllib import urllib2 import urlparse import os import sys, getopt, difflib import re import string version = sys.version_info if version < (2,6): print "Need python version 2.6 or better, %s.%s too old!" % version else: print "python version: ", version from lxml.html import parse,submit_form,fromstring,tostring import lxml.html from lxml import etree from StringIO import StringIO url = "http://nsted.ipac.caltech.edu" try: rc = urllib2.urlopen(url) contents = rc.read() rc.close() except urllib2.HTTPError,e: print "Error: Page not found",e sys.exit(1) except urllib2.URLError,e: print "Error: Connection refused ",e sys.exit(1) print "contents-------------\n"+contents[0:300] root = fromstring(contents) fd = open ("tempfile", "w") fd.write(contents) fd.close() root = parse("tempfile").getroot() htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True) print "htmlstr--------------\n"+htmlstr[0:300] htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True,\ include_meta_content_type=False,method='xml') print "htmlstr1-------------\n"+htmlstr[0:300] try: print root.docinfo.doctype except AttributeError,e: print e tree = etree.parse(StringIO("""<!DOCTYPE TS><TS></TS>""")) print "doctype",tree.docinfo.doctype Output: python version: (2, 6, 2, 'final', 0) contents------------- has original doctype <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <title>Welcome to NStED</tit htmlstr-------------- from lxml tostring, no doctype <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Welcome to NStED</title> <script type="text/javascript" src="/js/util.js"></script><link rel="stylesheet" type="text/css" media="all" href="/css/style.css"> <link rel="stylesheet" type="text/css" media="all" href="/cs htmlstr1------------- from lxml tostring, no doctype, convert as xml <?xml version='1.0' encoding='iso-8859-1'?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" xml:lang="en"> <head> <title>Welcome to NStED</title> <script type="text/javascript" src="/js/util.js"></script> <link rel="styleshee 'HtmlElement' object has no attribute 'docinfo' doctype <!DOCTYPE TS> <---- this one is ok So it was in contents from urlopen but missing in lxml fromstring and then tostring. Am I missing something ? Stefan Behnel wrote:
Mary Lei wrote:
I noticed that the xhtml converted from the parse tree has doctype missing. I am using lxml 2.2.
Is this bug still not fixed in lxml 2.2 ?
In order to convince others that this is a bug, you might want to provide some more information. Could you present a short code snippet that shows what you do and the (unexpected) result you get?
Stefan
-- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Mary Lei wrote:
here is an example: [...] root = fromstring(contents) fd = open ("tempfile", "w") fd.write(contents) fd.close()
root = parse("tempfile").getroot() htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True) ## htmlstr-------------- from lxml tostring, no doctype htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True,\ include_meta_content_type=False,method='xml') ## htmlstr1------------- from lxml tostring, no doctype, convert as xml
tree = etree.parse(StringIO("""<!DOCTYPE TS><TS></TS>""")) print "doctype",tree.docinfo.doctype ## doctype <!DOCTYPE TS> <---- this one is ok
So it was in contents from urlopen but missing in lxml fromstring and then tostring. Am I missing something ?
Yes. When you tell lxml to serialise an element, you get the element and nothing but that. If you want doctype declarations, DTDs, processing instructions and the like (i.e. stuff that doesn't belong to the element itself), you must wrap the element in an ElementTree and serialise that. Stefan
participants (2)
-
Mary Lei
-
Stefan Behnel