Mailman 3 [lxml-dev] lxml2.2 doctype missing - lxml - The Python XML Toolkit - python.org

newer
[lxml-dev] libxml2 crash on 64bit...

[lxml-dev] lxml2.2 doctype missing

older
Re: [lxml-dev] About the position...

Mary Lei

Aug. 5, 2009

4:48 p.m.

I noticed that the xhtml converted from the parse tree has doctype missing. I am using lxml 2.2. Is this bug still not fixed in lxml 2.2 ? -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998

Reply

Sign in to reply online Use email software

Show replies by date

Stefan Behnel

August 2009

5:31 p.m.

Mary Lei wrote:

I noticed that the xhtml converted from the parse tree has doctype missing. I am using lxml 2.2.

Is this bug still not fixed in lxml 2.2 ?

In order to convince others that this is a bug, you might want to provide some more information. Could you present a short code snippet that shows what you do and the (unexpected) result you get? Stefan

Reply

Sign in to reply online Use email software

Mary Lei

6:30 p.m.

here is an example: #!/bin/sh # next line restarts python \ "exec" "python" "-O" "$0" "$@" import urllib import urllib2 import urlparse import os import sys, getopt, difflib import re import string version = sys.version_info if version < (2,6): print "Need python version 2.6 or better, %s.%s too old!" % version else: print "python version: ", version from lxml.html import parse,submit_form,fromstring,tostring import lxml.html from lxml import etree from StringIO import StringIO url = "http://nsted.ipac.caltech.edu" try: rc = urllib2.urlopen(url) contents = rc.read() rc.close() except urllib2.HTTPError,e: print "Error: Page not found",e sys.exit(1) except urllib2.URLError,e: print "Error: Connection refused ",e sys.exit(1) print "contents-------------\n"+contents[0:300] root = fromstring(contents) fd = open ("tempfile", "w") fd.write(contents) fd.close() root = parse("tempfile").getroot() htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True) print "htmlstr--------------\n"+htmlstr[0:300] htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True,\ include_meta_content_type=False,method='xml') print "htmlstr1-------------\n"+htmlstr[0:300] try: print root.docinfo.doctype except AttributeError,e: print e tree = etree.parse(StringIO("""<!DOCTYPE TS><TS></TS>""")) print "doctype",tree.docinfo.doctype Output: python version: (2, 6, 2, 'final', 0) contents------------- has original doctype <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <title>Welcome to NStED</tit htmlstr-------------- from lxml tostring, no doctype <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Welcome to NStED</title> <script type="text/javascript" src="/js/util.js"></script><link rel="stylesheet" type="text/css" media="all" href="/css/style.css"> <link rel="stylesheet" type="text/css" media="all" href="/cs htmlstr1------------- from lxml tostring, no doctype, convert as xml <?xml version='1.0' encoding='iso-8859-1'?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" xml:lang="en"> <head> <title>Welcome to NStED</title> <script type="text/javascript" src="/js/util.js"></script> <link rel="styleshee 'HtmlElement' object has no attribute 'docinfo' doctype <!DOCTYPE TS> <---- this one is ok So it was in contents from urlopen but missing in lxml fromstring and then tostring. Am I missing something ? Stefan Behnel wrote:

Mary Lei wrote:

...
I noticed that the xhtml converted from the parse tree has doctype missing. I am using lxml 2.2.

Is this bug still not fixed in lxml 2.2 ?

In order to convince others that this is a bug, you might want to provide some more information. Could you present a short code snippet that shows what you do and the (unexpected) result you get?

Stefan

-- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998

Reply

Sign in to reply online Use email software

Stefan Behnel

6:56 p.m.

Mary Lei wrote:

here is an example: [...] root = fromstring(contents) fd = open ("tempfile", "w") fd.write(contents) fd.close()

root = parse("tempfile").getroot() htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True) ## htmlstr-------------- from lxml tostring, no doctype htmlstr = lxml.html.tostring(root,\ encoding="iso-8859-1",pretty_print=True,\ include_meta_content_type=False,method='xml') ## htmlstr1------------- from lxml tostring, no doctype, convert as xml

tree = etree.parse(StringIO("""<!DOCTYPE TS><TS></TS>""")) print "doctype",tree.docinfo.doctype ## doctype <!DOCTYPE TS> <---- this one is ok

So it was in contents from urlopen but missing in lxml fromstring and then tostring. Am I missing something ?

Yes. When you tell lxml to serialise an element, you get the element and nothing but that. If you want doctype declarations, DTDs, processing instructions and the like (i.e. stuff that doesn't belong to the element itself), you must wrap the element in an ElementTree and serialise that. Stefan

Reply

Sign in to reply online Use email software

5677

Age (days ago)

5682

Last active (days ago)

Download

3 comments

2 participants

tags

participants (2)

Mary Lei
Stefan Behnel