[lxml-dev] whitespace in lxml.html vs. lxml.html.soupparser
We're using CSSSelector to pull out document fragments. I noticed that the fragments from lxml.html.soupparser parses don't have extra whitespace (which is desirable) but fragments from lxml.html has extra whitespace cruft. For example w/soupparser: """<div class="post"><a name="8720086857907265707"/> <p/><div/>Josh Bancroft over at <a href="http://www.tinyscreenfuls.com/">TinyScreenfuls</a> puts together a great <a href="http://www.tinyscreenfuls.com/2008/01/site-statistics-i-care-about-as-a-blogger/">roundup of stats</a> that matter to bloggers with Google Analytics screen shots and meaningful context. The comments are helpful too.<br/><br/>Highly recommended.<br/><br/>Technorati Tags: <a href="http://technorati.com/tag/stats" rel="tag">Stats</a>,<br/><a href="http://technorati.com/tag/bloggers" rel="tag">Bloggers</a>,<br/><a href="http://technorati.com/tag/blogging" rel="tag">Blogging</a><div/> </div>""" w/o soupparser: """<div class="post"><a name="8720086857907265707"/> <p/><div/>Josh Bancroft over at <a href="http://www.tinyscreenfuls.com/">TinyScreenfuls</a> puts together a great <a href="http://www.tinyscreenfuls.com/2008/01/site-statistics-i-care-about-as-a-blogger/">roundup of stats</a> that matter to bloggers with Google Analytics screen shots and meaningful context. The comments are helpful too.<br/><br/>Highly recommended.<br/><br/>Technorati Tags: <a href="http://technorati.com/tag/stats" rel="tag">Stats</a>,<br/><a href="http://technorati.com/tag/bloggers" rel="tag">Bloggers</a>,<br/><a href="http://technorati.com/tag/blogging" rel="tag">Blogging</a><div/> </div> """ Is there a way to get the same output w/o soupparser as with? I'd hate to resort to post-processing the parses unnecessarily with regexps or such. thanks, -Ian
participants (1)
-
Ian Kallen