[lxml-dev] Premature end of data in tag - but it looks well formed
Hi gents, Firstly, thanks for lxml. It's by far the nicest tool for someone who needs to do xpath in python without being an XML god. I'm a first time user of lxml attempting to etree.parse a document. My code (below) works fine on some sample text, but libxml complains about the real data with: etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5 The data is below. Line 5 seems OK to me, but I'm new to XML coding so maybe I'm missing something. __________________________________ 1 2 3 <?xml version="1.0" encoding="iso-8859-1"?> 4 <!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 5 <html xmlns="http://www.w3.org/1999/xhtml"> __________________________________ Any ideas? The full code is below. Cheers, Mike #!/usr/bin/env python import urllib, sys, lxml, StringIO from lxml import etree from StringIO import StringIO # Use http://www.someproxy.com:3128 for http proxying proxies = {'http': 'http://xpvm:3128'} url='http://peoplesearch.in.telstra.com.au:8094/peoplesearch/userdetail.aspx?Base...' filehandle = urllib.urlopen(url, proxies=proxies) print filehandle ## Real html html=filehandle.read() ## Test html #html="<foo><bar><baz>underpants</baz></bar></foo>" print "--------------------------------" print html print '==========================' f = StringIO(html) tree = etree.parse(f) ## Real xpath r = tree.xpath('/html/body/div[4]/form/div[3]/div/div/div/div/table/tbody/tr[6]/td') ## Test xpath #r = tree.xpath('/foo/bar/baz') print 'length:' print len(r) print 'tag:' print r[0].tag print 'contents:' print r[0].text ________________________________________________ Mike MacCana Technical Specialist Australia Linux and Virtualisation Services IBM Global Services Level 14, 60 City Rd Southgate Vic 3000 Phone: +61-3-8656-2138 Fax: +61-3-8656-2423 Email: mmaccana@au1.ibm.com
Hi, Mike MacCana wrote:
Hi gents,
Are you sure you don't want advice from any girls?
I'm a first time user of lxml attempting to etree.parse a document. My code (below) works fine on some sample text, but libxml complains about the real data with:
etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5
The data is below. Line 5 seems OK to me, but I'm new to XML coding so maybe I'm missing something.
The problem is not in line 5 (where the html tag starts) but in line 196, where it apparently ends. Try validating it at the W3C validator if you don't believe lxml. ;) Stefan
Ladies and gentleman, On Tue, 2008-07-01 at 07:24 +0200, Stefan Behnel wrote:
Hi,
Mike MacCana wrote:
Hi gents,
Are you sure you don't want advice from any girls?
I'm a first time user of lxml attempting to etree.parse a document. My code (below) works fine on some sample text, but libxml complains about the real data with:
etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5
The data is below. Line 5 seems OK to me, but I'm new to XML coding so maybe I'm missing something.
The problem is not in line 5 (where the html tag starts) but in line 196, where it apparently ends. Try validating it at the W3C validator if you don't believe lxml. ;)
Thanks Stefan. I solved the crap HTML problem as follows. Hopefully the following will be useful to anyone beginning XPath with lxml. #!/usr/bin/env python import urllib, sys, lxml, StringIO, lxml.html,os from lxml import etree from StringIO import StringIO from lxml.html.clean import Cleaner ## Point this at your XP VM used to get to Telstra proxies = {'http': 'http://xpvm:3128'} url='http://domain.com/page' ## Function to strip non-ascii characters ## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters ## for list def onlyascii(char): if ord(char) < 32 or ord(char) > 176: return '' else: return char ## Open the URL and read its contents filehandle = urllib.urlopen(url, proxies=proxies) html=filehandle.read() asciihtml=filter(onlyascii, html) ## Customer's HTML content is REALLY bad. Clean it. ## See http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html ## and 'pydoc lxml.html.clean.Cleaner' ## Clean HTML and strip a bunch of tags that are broken and that we dont care about. badtags=['img','a','div','span','h2','h1','style','title','ul','li','col'] cleaner = Cleaner(page_structure=False, links=False, remove_tags=badtags ) ## We can now access our cleaned content as 'cleanedcontent' cleanedcontent=cleaner.clean_html(asciihtml) ## Save Clean content to disk for debugging purposes os.remove('debug.html') outputfile = open('debug.html','w') outputfile.write(cleanedcontent) outputfile.close() ## Go parse our content cleanedcontentstringio = StringIO(cleanedcontent) parser = etree.XMLParser(recover=True) tree = etree.parse(cleanedcontentstringio) ## Xpath locations of what we're interested in (element zero is all we care about ## text is the text within the tags, and strip off any whitespace ## You can find XPath locations by loading up 'debug.html' in Firefox with the Firebug extension name = tree.xpath('/html/body/table/tbody/tr/td')[0].text.strip() email = tree.xpath('/html/body/table/tbody/tr[7]/td')[0].text.strip().lower() print name+","+email Cheers, Mike ________________________________________________ Mike MacCana Technical Specialist Australia Linux and Virtualisation Services IBM Global Services Level 14, 60 City Rd Southgate Vic 3000 Phone: +61-3-8656-2138 Fax: +61-3-8656-2423 Email: mmaccana@au1.ibm.com
Hi, Mike MacCana wrote:
I solved the crap HTML problem as follows. Hopefully the following will be useful to anyone beginning XPath with lxml.
Just adding a few comments as I see fit.
## Function to strip non-ascii characters ## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters ## for list def onlyascii(char): if ord(char) < 32 or ord(char) > 176: return '' else: return char
Note that this will not work as expected with multi-byte encodings such as UTF-8.
## We can now access our cleaned content as 'cleanedcontent' cleanedcontent=cleaner.clean_html(asciihtml)
This will (obviously) parse the HTML into a tree internally, so it's more efficient to pass a parsed tree directly.
## Go parse our content cleanedcontentstringio = StringIO(cleanedcontent) parser = etree.XMLParser(recover=True) tree = etree.parse(cleanedcontentstringio)
I wonder why you use an XML parser here. The HTML parser will likely work better, as it knows about self-closing HTML tags. Stefan
participants (2)
-
Mike MacCana
-
Stefan Behnel