[lxml-dev] ElementSoup doesn't work as in doc/elementsoup.txt
tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>' from lxml.html.ElementSoup import parse from StringIO import StringIO root = parse(StringIO(tag_soup)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py",
Hello. I'm learning ElementSoup, but it doesn't work the way it's supposed to be. I tried sample code in doc/elementsoup.txt but failed with error. --------------------------------------------------------------------------------------------------------------------- line 19, in parse root = _convert_tree(tree, makeelement) File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py", line 40, in _convert_tree attrib=dict(beautiful_soup_tree.attrs)) File "parser.pxi", line 702, in etree._BaseParser.makeelement File "apihelpers.pxi", line 102, in etree._makeElement File "apihelpers.pxi", line 798, in etree._tagValidOrRaise ValueError: Invalid tag name u'[document]' --------------------------------------------------------------------------------------------------------------------- I'm using Python2.5 lxml-2.0alpha3 BeautifulSoup 3.0.4 Any clues?
Hi, js wrote:
tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>' from lxml.html.ElementSoup import parse from StringIO import StringIO root = parse(StringIO(tag_soup)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py",
I'm learning ElementSoup, but it doesn't work the way it's supposed to be. I tried sample code in doc/elementsoup.txt but failed with error. --------------------------------------------------------------------------------------------------------------------- line 19, in parse root = _convert_tree(tree, makeelement) File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py", line 40, in _convert_tree attrib=dict(beautiful_soup_tree.attrs)) File "parser.pxi", line 702, in etree._BaseParser.makeelement File "apihelpers.pxi", line 102, in etree._makeElement File "apihelpers.pxi", line 798, in etree._tagValidOrRaise ValueError: Invalid tag name u'[document]' ---------------------------------------------------------------------------------------------------------------------
That's because of the tag name validation. Evidently, "[document]" (which is returned by BeautifulSoup) isn't a valid tag name. Sadly, the doctest above was not yet included in the test suite. However, the behaviour will change in alpha 4. lxml will no longer reject tag names except if they contain spaces or XML special characters. See this recent thread, which also has a patch: http://comments.gmane.org/gmane.comp.python.lxml.devel/3003?set_lines=100000 Sorry for the inconvenience, but don't forget that this is alpha software. Things might not always work as expected or might change unexpectedly (although we try to keep these changes as rare as possible). Stefan
participants (2)
-
js
-
Stefan Behnel