Mutating an HTML file with BeautifulSoup

Chris Angelico rosuav at gmail.com
Sat Aug 20 20:43:50 EDT 2022


On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
<python-list at python.org> wrote:
>
> On 2022-08-20, Chris Angelico <rosuav at gmail.com> wrote:
> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> >> 2QdxY4RzWzUUiLuE at potatochowder.com writes:
> >> >textual representations.  That way, the following two elements are the
> >> >same (and similar with a collection of sub-elements in a different order
> >> >in another document):
> >>
> >>   The /elements/ differ. They have the /same/ infoset.
> >
> > That's the bit that's hard to prove.
> >
> >>   The OP could edit the files with regexps to create a new version.
> >
> > To you and Jon, who also suggested this: how would that be beneficial?
> > With Beautiful Soup, I have the line number and position within the
> > line where the tag starts; what does a regex give me that I don't have
> > that way?
>
> You mean you could use BeautifulSoup to read the file and identify the
> bits you want to change by line number and offset, and then you could
> use that data to try and update the file, hoping like hell that your
> definition of "line" and "offset" are identical to BeautifulSoup's
> and that you don't mess up later changes when you do earlier ones (you
> could do them in reverse order of line and offset I suppose) and
> probably resorting to regexps anyway in order to find the part of the
> tag you want to change ...
>
> ... or you could avoid all that faff and just do re.sub()?

Stefan answered in part, but I'll add that it is far FAR easier to do
the analysis with BS4 than regular expressions. I'm not sure what
"hoping like hell" is supposed to mean here, since the line and offset
have been 100% accurate in my experience; the only part I'm unsure
about is where the _end_ of the tag is (and maybe there's a way I can
use BS4 again to get that??).

ChrisA


More information about the Python-list mailing list