Mutating an HTML file with BeautifulSoup

Chris Angelico rosuav at gmail.com
Sun Aug 21 00:25:23 EDT 2022


On Sun, 21 Aug 2022 at 13:41, dn <PythonList at danceswithmice.info> wrote:
>
> On 21/08/2022 13.00, Chris Angelico wrote:
> > Well, I don't like headaches, but I do appreciate what the G&S Archive
> > has given me over the years, so I'm taking this on as a means of
> > giving back to the community.
>
> This point will be picked-up in the conclusion. NB in the same way that
> you want to 'give back', so also do others - even if in minor ways or
> 'when-relevant'!

Very true.

> >> In fact, depending upon frequency, making the changes manually (and with
> >> improved confidence in the result).
> >
> > Unfortunately the frequency is very high.
>
> Screechingly so? Like you're singing Three Little Maids?

You don't want to hear me singing that.... although I do recall once
singing Lady Ella's part at a Qwert, to gales of laughter.

> > Yeah. I do a first pass to enumerate all domains that are ever linked
> > to with http:// URLs, and then I have a script that goes through and
> > checks to see if they redirect me to the same URL on the other
> > protocol, or other ways of checking. So yes, the list of valid domains
> > is part of the program's effective input.
>
> Wow! Having got that far, you have achieved data-validity. Is there a
> need to perform a before-after check or diff?

Yes, to ensure that nothing has changed that I *didn't* plan. The
planned changes aren't the problem here, I can verify those elsewhere.

> Perhaps start making the one-for-one replacements without further
> anxiety. As long as there's no silly-mistake, eg failing to remove an
> opening or closing angle-bracket; isn't that about all the checking needed?
> (for this category of updates)

Maybe, but probably not.

> BTW in talk of "line-number", you will have realised the need to re-run
> the identification of such after each of these steps - in case the 'new
> stuff' relating to earlier steps (assuming above became also a temporal
> sequence) is shorter/longer than the current HTML.

Yep, that's not usually a problem.

> >>> And there'll be other fixes to be done too. So it's a bit complicated,
> >>> and no simple solution is really sufficient. At the very very least, I
> >>> *need* to properly parse with BS4; the only question is whether I
> >>> reconstruct from the parse tree, or go back to the raw file and try to
> >>> edit it there.
> >>
> >> At least the diffs would give you something to work-from, but it's a bit
> >> like git-diffs claiming a 'change' when the only difference is that my
> >> IDE strips blanks from the ends of code-lines, or some-such silliness.
> >
> > Right; and the reconstructed version has a LOT of those unnecessary
> > changes. I'm seeing a lot of changes to whitespace. The only problem
> > is whether I can be confident that none of those changes could ever
> > matter.
>
> "White-space" has lesser-meaning in HTML - this is NOT Python! In HTML
> if I write "HTML  file" (with two spaces), the browser will shorten the
> display to a single space (hence some uses of   - non-broken
> space). Similarly, if attempt to use "\n" to start a new line of text...

Yes, whitespace has less meaning... except when it doesn't.

https://developer.mozilla.org/en-US/docs/Web/CSS/white-space

Text can become preformatted by the styling, and there could be
nothing whatsoever in the HTML page that shows this. I think most of
the HTML files in this site have been created by a WYSIWYG editor,
partly because of clues like a single bold space in a non-bold
sequence of text, and the styles aren't consistent everywhere. Given
that poetry comes up a lot on this site, I wouldn't put it past the
editor to have set a whitespace rule on something.

But I'm probably going to just ignore that and hope that any such
errors are less significant than the current set of broken links.

> Is there a danger of 'chasing your own tail', ie seeking a solution to a
> problem which really doesn't matter (particularly if we add the phrase:
> at the user-level)?

Unfortunately not. I now know of three categories of change that, in
theory, shouldn't affect anything: whitespace, order of attributes
("<a id=... href=...>" becoming "<a href=... id=...>"), and
self-closing tags. Whitespace probably won't matter, until it does.
Order of attributes is absolutely fine.... unless one of them is
miswritten and now we've lost a lot of information about how it ought
to have been written. And self-closing tags are probably
insignificant, but I don't know how browsers handle things like
"<div><p>...<p/></div>" - and I wouldn't know whether the original
intention was for the second one to be a self-closing empty paragraph,
or a miswritten closing tag.

It's easy to say that these changes have no effect on well-formed
HTML. It's less easy to know what browsers will do with ill-formed
HTML.

> Agree with "properly parse". Question was an apparent dedication to BS4
> when there are other tools. Just checking you aren't wearing that type
> of 'blinders'.
> (didn't think so, but...)

No, but there's also always the option of some tool that I've never
heard of! The single most obvious *to me* might not be the best
overall.


> >> Is Selenium worthy of consideration?
> >
> > Yes..... but I don't know how much it would buy me. It certainly has
> > no options for editing back the original HTML, so all it would do is
> > the parsing side of things (which is already working fine).
>
> In which case, no gain.
> (I probably use it more than BS - but because it is useful to 'test'
> web-pages, GUI behavior, etc)

Yeah, that's what it shines at.

> A better 'diff' would not look at the HTML, but compare the web-page's
> before and after 'appearances'!
> (after all, the link-checker has already figured-out the 'behind the
> scenes' part)

In theory, but what constitutes a reasonable change to the appearance?

> > Yeah, and the fundamental problem with the MD conversion is time -
> > it's a big manual process. I want to be able to do that progressively
> > over time, but get the basic stuff sorted out much sooner. Ideally, it
> > should be possible to fix all the autofixable links this week and get
> > that sorted out, but converting pages to Markdown will happen slowly
> > over the next few years.
>
> Not something I've ever needed to consider. Are there no tools for this?

Yes and no. Part of the reason for doing it manually is that you need
to decide what parts go in the individual pages (the Markdown files)
and what belongs in the layout file (a single HTML file).
Deduplication requires a measure of intelligence, especially when it's
not 100% identical on different pages, so you have to figure out what
parameterization is worth doing.

> Warning: more 'normal people' know something of HTML, than do of
> Markdown. In fact, whereas many people from outside of IT attend our
> HTML5 courses (implicit disclaimer!), I'll suggest that if a person
> knows Markdown (s)he is at least 85% likely to be in IT.

Maybe, but it's also far FAR easier to learn Markdown. Here's a web
site that is managed by a digital artist, most definitely not an IT
person: https://devicatoutlet.com/ I helped her get it all set up, and
she edits the Markdown files herself. (And commits and pushes to
GitHub so it can be hosted on GH Pages, which disproves any myths that
git is impossible for non-programmers to use.)

> Another ([in]famous dn off-the-wall) question: have you considered
> 'crowd-sourcing' the project? There are bound to be members who
> particularly favor one operetta or song. If the project were moved to a
> wiki or software like WordPress, would individual members be prepared to
> copy-paste from 'the old' to 'the new', checking links and copy-pasting
> the URL, etc, as they go?

I'd rather move into Markdown than either MediaWiki or Wordpress,
partly because PHP sucks, and partly because it's nearly impossible to
maintain URLs when you do a migration like that. (In the case of the
Markdown files, I do have a special step in the build system to allow
me to force the output file name, but I almost never need it - only
for a few special cases where it's ".htm" instead of ".html" or
something.) Breaking URLs is a major problem.

But once the Markdown system is in place, I absolutely would welcome
people contributing changes in that way. That would end up being part
of the long-term job though.

> NB I recall (somewhere) a claim (about "distraction" and "consumption"
> cf "creativity") that if all Americans were to stop watching
> ("consuming") TV for a single weekend, it would release sufficient time
> to "create" Wikipedia in its entirety.

Yes, because we absolutely NEED a third of a billion lazy people
trying to write content while wanting to just flake :)

> This is the opposite of the mantra I (over-)frequently recite to
> trainees (particularly those who have just learned JS and think they can
> now 'take-over the world'...) "just because we can do it, doesn't make
> it a good idea"!

But taking over the world is so tempting.......

> Similarly, is it possible that you are attempting to be "the very model
> of a modern Major-General", whilst the service required may be more on
> the line about being "very good at [la, la, la; something or other - he
> can't remember the words)] and calculus" (IT not invented back in the
> days of the Pirates of Penzance)?
>
> Looking at the site, (unkindly-speaking) it is reminiscent of
> AOL/GeoCities days. Which is not necessarily 'bad', but is indicative of
> a membership who care less about appearances and more about 'the
> business' of the group.

I think it's more a result of design-by-committee, or rather,
design-by-whoever-feels-like-doing-this (a Charlieocracy of sorts).
For now, I'm maintaining most of that, but hey, if someone looks at
the Markdown version and wants to make improvements, they'd only have
to change the layout file!

> In short (referring back to the 'list of options', above, top), there's
> no need to "give up" (and I can't see you allowing yourself to do-so
> anyway) but perhaps grant yourself permission to accept a result
> (slightly) less than 100%!

I'm absolutely not giving up, and yes, I will accept a result below
100%. What I would like to ensure, if possible, is that I do not
create additional problems while fixing these.

> (and yes, maybe that is the "lazy" coming-out, but there's also a sense
> of YAGNI when investing significant effort into infrequently used
> resources/viewed web-pages - maybe apply the 80:20 'rule'?)
>
>
> Meantime, I'm off to find my boxed-set and play some CDs of 'silly
> songs' while I work...

Enjoy!

ChrisA


More information about the Python-list mailing list