Mutating an HTML file with BeautifulSoup

David bouncingcats at gmail.com
Fri Aug 19 20:02:30 EDT 2022


On Sat, 20 Aug 2022 at 04:31, Chris Angelico <rosuav at gmail.com> wrote:

> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?

> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.

> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?

On Sat, 20 Aug 2022 at 07:02, Chris Angelico <rosuav at gmail.com> wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:

> > I recall that in bs4 it parses into an object tree and loses the detail
> > of the input.  I recently ported from very old bs to bs4 and hit the
> > same issue.  So no it will not output the same as went in.

> So I'm left with a few options:

> 1) Give up on validation, give up on verification, and just run this
>    thing on the production site with my fingers crossed

> 2) Instead of doing an intelligent reconstruction, just str.replace() one
>    URL with another within the file

> 3) Split the file into lines, find the Nth line (elem.sourceline) and
>    str.replace that line only

> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start of
>    the tag, manually find the end, and replace one tag with the
>    reconstructed form.

> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...

Hi, I don't know if you will like this option, but I don't see it on the
list yet so ...

I'm assuming that the phrase "with minimal diffs so they're easy to
validate" means being eyeballed by a human.

Have you considered two passes through BS? Do the first pass with no
modification, so that the intermediate result gets the BS default
"spurious" changes.

Then do the second pass with the desired changes, so that the human will
see only the desired changes in the diff.


More information about the Python-list mailing list