Mutating an HTML file with BeautifulSoup

Sun Aug 21 07:24:16 EDT 2022

On 2022-08-21, Chris Angelico <rosuav at gmail.com> wrote:
> On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
><python-list at python.org> wrote:
>> On 2022-08-20, Chris Angelico <rosuav at gmail.com> wrote:
>> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>> >> 2QdxY4RzWzUUiLuE at potatochowder.com writes:
>> >> >textual representations.  That way, the following two elements are the
>> >> >same (and similar with a collection of sub-elements in a different order
>> >> >in another document):
>> >>
>> >>   The /elements/ differ. They have the /same/ infoset.
>> >
>> > That's the bit that's hard to prove.
>> >
>> >>   The OP could edit the files with regexps to create a new version.
>> >
>> > To you and Jon, who also suggested this: how would that be beneficial?
>> > With Beautiful Soup, I have the line number and position within the
>> > line where the tag starts; what does a regex give me that I don't have
>> > that way?
>>
>> You mean you could use BeautifulSoup to read the file and identify the
>> bits you want to change by line number and offset, and then you could
>> use that data to try and update the file, hoping like hell that your
>> definition of "line" and "offset" are identical to BeautifulSoup's
>> and that you don't mess up later changes when you do earlier ones (you
>> could do them in reverse order of line and offset I suppose) and
>> probably resorting to regexps anyway in order to find the part of the
>> tag you want to change ...
>>
>> ... or you could avoid all that faff and just do re.sub()?
>
> Stefan answered in part, but I'll add that it is far FAR easier to do
> the analysis with BS4 than regular expressions. I'm not sure what
> "hoping like hell" is supposed to mean here, since the line and offset
> have been 100% accurate in my experience;

Given the string:

    b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"

what is the line number and offset of the question mark - and does
BeautifulSoup agree with your answer? Does the answer to that second
question change depending on what parser you tell BeautifulSoup to use?

(If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
I am happy with the program throwing an exception" then feel free to
remove that substring from the question.)

> the only part I'm unsure about is where the _end_ of the tag is (and
> maybe there's a way I can use BS4 again to get that??).

There doesn't seem to be. More to the point, there doesn't seem to be
a way to find out where the *attributes* are, so as I said you'll most
likely end up using regexps anyway.