[lxml-dev] new ElementSoup module in lxml.html
Hi, I rewrote Fredrik's ElementSoup.py module for lxml.html so that you can now have lxml read in tag soup with BeautifulSoup and convert it into an lxml.html tree of Elements. While libxml2 can also parse broken HTML, it is not made to parse sick soup of tags, so if you need to work with web pages that sort of look like they might have been HTML once, the lxml.html.ElementSoup module can help you get there. http://codespeak.net/svn/lxml/branch/html/doc/elementsoup.txt http://codespeak.net/svn/lxml/branch/html/src/lxml/html/ElementSoup.py Have fun, Stefan
Hi Stefan, I hadn't tried to use the lxml.html module before, but it doesn't seem to be in trunk (only in branch). So I guess this means it can only be installed from source? (eggs are only made from the trunk?) In which case, does your elementsoup.py really need lxml.html? I noticed elementsoup.py only uses "makeelement" from lxml.html.html_parser. Can I get away with using anything from the trunk instead? cheers, -Roger Stefan Behnel wrote:
Hi,
I rewrote Fredrik's ElementSoup.py module for lxml.html so that you can now have lxml read in tag soup with BeautifulSoup and convert it into an lxml.html tree of Elements. While libxml2 can also parse broken HTML, it is not made to parse sick soup of tags, so if you need to work with web pages that sort of look like they might have been HTML once, the lxml.html.ElementSoup module can help you get there.
http://codespeak.net/svn/lxml/branch/html/doc/elementsoup.txt http://codespeak.net/svn/lxml/branch/html/src/lxml/html/ElementSoup.py
Have fun, Stefan _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
Roger Patterson wrote:
I hadn't tried to use the lxml.html module before, but it doesn't seem to be in trunk (only in branch). So I guess this means it can only be installed from source? (eggs are only made from the trunk?)
In which case, does your elementsoup.py really need lxml.html? I noticed elementsoup.py only uses "makeelement" from lxml.html.html_parser. Can I get away with using anything from the trunk instead?
Hmm, there isn't currently a trunk release either, but it /should/ also work with 1.3.2. Just take ElementSoup.py and pass your own "makeelement" function to parse(), try etree.Element for starters. Stefan
participants (2)
-
Roger Patterson -
Stefan Behnel