Mutating an HTML file with BeautifulSoup
Peter Otten
__peter__ at web.de
Mon Aug 22 02:30:00 EDT 2022
On 22/08/2022 05:30, Chris Angelico wrote:
> On Mon, 22 Aug 2022 at 10:04, Buck Evan <buck.2019 at gmail.com> wrote:
>>
>> I've had much success doing round trips through the lxml.html parser.
>>
>> https://lxml.de/lxmlhtml.html
>>
>> I ditched bs for lxml long ago and never regretted it.
>>
>> If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`.
>> Unless I'm mistaken, all such changes should fall into no more than a dozen groups.
>>
>
> Will this round-trip mutate every single file and reorder the tag
> attributes? Because I really don't want to manually eyeball all those
> changes.
Most certainly not. Reordering is a bs4 feature that is governed by a
formatter. You can easily prevent that attributes are reorderd:
>>> import bs4
>>> soup = bs4.BeautifulSoup("""<div beta="1" alpha="2"/>""")
>>> soup
<html><body><div alpha="2" beta="1"></div></body></html>
>>> class Formatter(bs4.formatter.HTMLFormatter):
def attributes(self, tag):
return [] if tag.attrs is None else list(tag.attrs.items())
>>> soup.decode(formatter=Formatter())
'<html><body><div beta="1" alpha="2"></div></body></html>'
Blank space is probably removed by the underlying html parser.
It might be possible to make bs4 instantiate the lxml.html.HTMLParser
with remove_blank_text=False, but I didn't try hard enough ;)
That said, for my humble html scraping needs I have ditched bs4 in favor
of lxml and its xpath capabilities.
More information about the Python-list
mailing list