Mutating an HTML file with BeautifulSoup

Peter J. Holzer hjp-python at hjp.at
Sun Aug 21 14:19:07 EDT 2022


On 2022-08-20 21:51:41 -0000, Jon Ribbens via Python-list wrote:
> On 2022-08-20, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> > Jon Ribbens <jon+usenet at unequivocal.eu> writes:
> >>... or you could avoid all that faff and just do re.sub()?

> > source = '<a name="b" href="http" accesskey="c"></a>'
> >
> > # Use Python to change the source, keeping the order of attributes.
> >
> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )

Depending on the content of the site, this might replace some stuff
which is not a link.


> You could go a bit harder with the regexp of course, e.g.:
> 
>   result = re.sub(
>       r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",

This will fail on:
    <a alt="42 > 23" href="the.answer.html">

The problem can be solved with regular expressions (and given the
constraints I think I would prefer that to using Beautiful Soup), but
getting the regexps right is not trivial, at least in the general case.
It may become a lot easier if you know that certain conventions were
followed (e.g. that ">" was always written as ">") or it may become
even harder when the files contain errors.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20220821/5516d233/attachment.sig>


More information about the Python-list mailing list