Hello Everyone,
This is my first posting to this list - so excuse if anything does
not meet the standards.
I use lxml.html.fromstring() to parse html.
The parser tries to do its best to make something reasonable,
even when the input is broken. This works fine and the parser does not
"invent" elements. I.e. the resulting tree does not contain elements
never present in the input.
But this is what I observe when the input is of following kind:
<html><body> .... </body><html
Note the missing '>' at the end of the input!
Whether "inventing" elements is a bug in case of invalid input
is debatable, but what if the number of elements is nearly doubled?
Please consider the following script which illustrates the effect:
It creates inside the <body> element a sequence of <img> elements
and checks after parsing the number of elements reported by
iterlinks():
=======================================================================
import sys
import lxml.html
import lxml.etree
parser= lxml.html.HTMLParser()
failCount= 0
for imageCount in range(1,20):
# Produce some simple HTML document with some <img> elements
content= '<html>\n<body>\n%s</body>\n</html>' % (
'\n'.join(['<img src="verysmall-icon-%d.png" align="right">' % i
for i in range(imageCount)])
)
# Parse this and assert the number of links found.
# (this works always)
html= lxml.html.fromstring(content, parser=parser)
imagesFound= len([x for x in html.iterlinks()])
assert(imagesFound == imageCount)
# Now remove the last '>' of the closing '<html>' element.
# After some tries, the parser "resuses" some of its
# parsed tree fragments and appends them to the tree.
# These fragments may even come from completly different
# parsed documents.
content=content[:-1]
html= lxml.html.fromstring(content, parser=parser)
imagesFound= len([x for x in html.iterlinks()])
if imageCount != imagesFound:
print 'Input:\n%s\n%s\n%s' % ('-'*40, content, '-'*40)
print 'FAILURE: found %d img elements when only %d were present' % (imagesFound, imageCount)
break
versionFmt= "%-25s %s"
print
print versionFmt % ('Python', sys.version_info)
for vers in (
'LXML_VERSION',
'LIBXML_VERSION',
'LIBXML_COMPILED_VERSION',
'LIBXSLT_VERSION',
'LIBXSLT_COMPILED_VERSION',
):
print versionFmt % (vers, getattr(lxml.etree, vers))
=======================================================================
On my machine (Ubuntu 12.04) the output is:
=======================================================================
Input:
----------------------------------------
<html>
<body>
<img src="verysmall-icon-0.png" align="right">
<img src="verysmall-icon-1.png" align="right">
<img src="verysmall-icon-2.png" align="right">
<img src="verysmall-icon-3.png" align="right">
<img src="verysmall-icon-4.png" align="right">
<img src="verysmall-icon-5.png" align="right">
<img src="verysmall-icon-6.png" align="right">
<img src="verysmall-icon-7.png" align="right">
<img src="verysmall-icon-8.png" align="right">
<img src="verysmall-icon-9.png" align="right">
<img src="verysmall-icon-10.png" align="right"></body>
</html
----------------------------------------
FAILURE: found 20 img elements when only 11 were present
Python sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
LXML_VERSION (3, 3, 5, 0)
LIBXML_VERSION (2, 7, 8)
LIBXML_COMPILED_VERSION (2, 7, 8)
LIBXSLT_VERSION (1, 1, 26)
LIBXSLT_COMPILED_VERSION (1, 1, 26)
=======================================================================
On a different machine (Solaris 10 ;-)
=======================================================================
Input:
----------------------------------------
<html>
<body>
<img src="verysmall-icon-0.png" align="right">
<img src="verysmall-icon-1.png" align="right">
<img src="verysmall-icon-2.png" align="right">
<img src="verysmall-icon-3.png" align="right">
<img src="verysmall-icon-4.png" align="right">
<img src="verysmall-icon-5.png" align="right">
<img src="verysmall-icon-6.png" align="right">
<img src="verysmall-icon-7.png" align="right">
<img src="verysmall-icon-8.png" align="right">
<img src="verysmall-icon-9.png" align="right">
<img src="verysmall-icon-10.png" align="right"></body>
</html
----------------------------------------
FAILURE: found 20 img elements when only 11 were present
Python sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
LXML_VERSION (2, 3, 5, 0)
LIBXML_VERSION (2, 9, 0)
LIBXML_COMPILED_VERSION (2, 6, 23)
LIBXSLT_VERSION (1, 1, 28)
LIBXSLT_COMPILED_VERSION (1, 1, 24)
=======================================================================
I've discovered this behaviour when crawling a web site.
I do this multi threaded and the links reported by iterlinks()
returned 404 when the crawler tried to fetch them.
The reason was iterlinks(): it was running on a tree, built from
a webpage with missing '>' at the end. The parser produced
a tree with lot of fragments coming from other parsed pages...
You can imagine what happens then.
Yours,
Elmar.
--
LEO GmbH | Elmar Bartel |
Mühlweg 2b | Phone: +49 (0)8104-90950141 | No signature here.
D-82054 Sauerlach | Fax: +49 (0)8104-90950290 |
Germany | Email: elmar(a)leo.org |
Register Gericht: Amtsgericht München, HRB161107
Geschäftsführer: Hans Riethmayer, Elmar Bartel