Mutating an HTML file with BeautifulSoup

Sat Aug 20 19:46:24 EDT 2022

On 20/08/2022 12.38, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 10:19, dn <PythonList at danceswithmice.info> wrote:
>> On 20/08/2022 09.01, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
>>>>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
>>>>>
>>>>> What's the best way to precisely reconstruct an HTML file after
>>>>> parsing it with BeautifulSoup?
...

>>> well. Thanks for trying, anyhow.
>>>
>>> So I'm left with a few options:
>>>
>>> 1) Give up on validation, give up on verification, and just run this
>>> thing on the production site with my fingers crossed
>>> 2) Instead of doing an intelligent reconstruction, just str.replace()
>>> one URL with another within the file
>>> 3) Split the file into lines, find the Nth line (elem.sourceline) and
>>> str.replace that line only
>>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
>>> of the tag, manually find the end, and replace one tag with the
>>> reconstructed form.
>>>
>>> I'm inclined to the first option, honestly. The others just seem like
>>> hard work, and I became a programmer so I could be lazy...
>> +1 - but I've noticed that sometimes I have to work quite hard to be
>> this lazy!
> 
> Yeah, that's very true...
> 
>> Am assuming that http -> https is not the only 'change' (if it were,
>> you'd just do that without BS). How many such changes are planned/need
>> checking? Care to list them?

This project has many of the same 'smells' as a database-harmonisation
effort. Particularly one where 'the previous guy' used to use field-X
for certain data, but his replacement decided that field-Y 'sounded
better' (or some such user-logic). Arrrggghhhh!

If you like head-aches, and users coming to you with ifs-buts-and-maybes
AFTER you've 'done stuff', this is your sort of project!

> Assumption is correct. The changes are more of the form "find all the
> problems, add to the list of fixes, try to minimize the ones that need
> to be done manually". So far, what I have is:

Having taken the trouble to identify this list of improvements and given
the determination to verify each, consider working through one item at a
time, rather than in a single pass. This will enable individual logging
of changes, a manual check of each alteration, and the ability to
choose/tailor the best tool for that specific task.

In fact, depending upon frequency, making the changes manually (and with
improved confidence in the result).

The presence of (or allusion to) the word "some" in this list-items is
'the killer'. Automation doesn't like 'some' (cf "all") unless the
criteria can be clearly and unambiguously defined. Ouch!

(I don't think you need to be told any of this, but hey: dreams are free!)

> 1) A bunch of http -> https, but not all of them - only domains where
> I've confirmed that it's valid

The search-criteria is the list of valid domains, rather than the
"http/https" which is likely the first focus.

> 2) Some absolute to relative conversions:
> https://www.gsarchive.net/whowaswho/index.htm should be referred to as
> /whowaswho/index.htm instead

Similarly, if you have a list of these.

> 3) A few outdated URLs for which we know the replacement, eg
> http://www.cris.com/~oakapple/gasdisc/<anything> to
> http://www.gasdisc.oakapplepress.com/<anything> (this one can't go on
> HTTPS, which is one reason I can't shortcut that)

Again.

> 4) Some internal broken links where the path is wrong - anything that
> resolves to /books/<anything> but can't be found might be better
> rewritten as /html/perf_grps/websites/<anything> if the file can be
> found there

Again.

> 5) Any external link that yields a permanent redirect should, to save
> clientside requests, get replaced by the destination. We have some
> Creative Commons badges that have moved to new URLs.

Do you have these as a list, or are you intending the automated-method
to auto-magically follow the link to determine any need for action?

> And there'll be other fixes to be done too. So it's a bit complicated,
> and no simple solution is really sufficient. At the very very least, I
> *need* to properly parse with BS4; the only question is whether I
> reconstruct from the parse tree, or go back to the raw file and try to
> edit it there.

At least the diffs would give you something to work-from, but it's a bit
like git-diffs claiming a 'change' when the only difference is that my
IDE strips blanks from the ends of code-lines, or some-such silliness.

Which brings me to ask: why "*need* to properly parse with BS4"?

What about selective use of tools, previously-mentioned in this thread?

Is Selenium worthy of consideration?

I'm assuming you've already been using a link-checker utility to locate
the links which need to be changed. They can be used in QA-mode
after-the-fact too.

> For the record, I have very long-term plans to migrate parts of the
> site to Markdown, which would make a lot of things easier. But for
> now, I need to fix the existing problems in the existing HTML files,
> without doing gigantic wholesale layout changes.

...and there's another option. If the Markdown conversion is done first,
it will obviate any option of diffs completely. However, it will
introduce a veritable cornucopia of opportunity for this and 'other
stuff' to go-wrong, bringing us back to a page-by-page check or
broad-checks only, and an appeal to readers to report problems.

The (PM-oriented) observation is that if you are baulking at the amount
of work 'now', you'll be equally dismayed by the consequences of a
subsequent 'Markdown project'!

Perhaps, therefore, some counter-intuitive logic, eg combining the
two/biting two bullets/recognising that many of risks and likelihoods of
error overlap (rather than add/multiply).

'Bit rot' is so common in today's world, do readers treat such
pages/sites particularly differently?

Somewhat conversely, even in our 'release-often, break-early' world, do
users often exert themselves to provide constructive feedback, eg 'link
broken'?

-- 
Regards,
=dn