Beautiful Soup - close tags more promptly?

Mon Oct 24 13:17:55 EDT 2022

On 2022-10-25 03:09:33 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
> <python-list at python.org> wrote:
> > On 2022-10-24, Chris Angelico <rosuav at gmail.com> wrote:
> > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > >> Yes, I got that. What I wanted to say was that this is indeed a bug in
> > >> html.parser and not an error (or sloppyness, as you called it) in the
> > >> input or ambiguity in the HTML standard.
> > >
> > > I described the HTML as "sloppy" for a number of reasons, but I was of
> > > the understanding that it's generally recommended to have the closing
> > > tags. Not that it matters much.
> >
> > Some elements don't need close tags, or even open tags. Unless you're
> > using XHTML you don't need them and indeed for the case of void tags
> > (e.g. <br>, <img>) you must not include the close tags.
> 
> Yep, I'm aware of void tags, but I'm talking about the container tags
> - in this case, <li> and <p> - which, in a lot of older HTML pages,
> are treated as "separator" tags. Consider this content:
> 
> <HTML>
> Hello, world!
> <P>
> Paragraph 2
> <P>
> Hey look, a third paragraph!
> </HTML>
> 
> Stick a doctype onto that and it should be valid HTML5, but as it is,
> it's the exact sort of thing that was quite common in the 90s.
> 
> The <p> tag is not a void tag, but according to the spec, it's legal
> to omit the </p> if the element is followed directly by another <p>
> element (or any of a specific set of others), or if there is no
> further content.

Right. The parser knows the structure of an HTML document, which tags
are optional and which elements can be inside of which other elements.
For SGML-based HTML versions (2.0 to 4.01) this is formally described by
the DTD.

So when parsing your file, an HTML parser would work like this

    <HTML> - Yup, I expect an HTML element here:
        HTML
    Hello, world! - #PCDATA? Not allowed as a child of HTML. There must
        be a HEAD and a BODY, both of which have optional start tags.
        HEAD can't contain #PCDATA either, so we must be inside of BODY
        and HEAD was empty:
        HTML
          ├─ HEAD
          └─ BODY
               └─ Hello, world!
    <P> - Allowed in BODY, so just add that:
        HTML
          ├─ HEAD
          └─ BODY
               ├─ #PCDATA: Hello, world!
               └─ P
    Paragraph 2 - #PCDATA is allowed in P, so add it as a child:
        HTML
          ├─ HEAD
          └─ BODY
               ├─ #PCDATA: Hello, world!
               └─ P
                   └─ #PCDATA: Paragraph 2
    <P> - Not allowed inside of P, so that implicitely closes the
        previous P element and we go up one level:
        HTML
          ├─ HEAD
          └─ BODY
               ├─ #PCDATA: Hello, world!
               ├─ P
               │   └─ #PCDATA: Paragraph 2
               └─ P
    Hey look, a third paragraph! - Same as above:
        HTML
          ├─ HEAD
          └─ BODY
               ├─ #PCDATA: Hello, world!
               ├─ P
               │   └─ #PCDATA: Paragraph 2
               └─ P
                   └─ #PCDATA: Hey look, a third paragraph!
    </HTML> - The end tags of P and BODY are optional, so the end of
        HTML closes them implicitely, and we have our final parse tree
        (unchanged from the last step):
        HTML
          ├─ HEAD
          └─ BODY
               ├─ #PCDATA: Hello, world!
               ├─ P
               │   └─ #PCDATA: Paragraph 2
               └─ P
                   └─ #PCDATA: Hey look, a third paragraph!

For a human, the <p> tags might feel like separators here. But
syntactically they aren't - they start a new element. Note especially
that "Hello, world!" is not part of a P element but a direct child of
BODY (which may or may not be intended by the author).

> 
> > Adding in the omitted <head>, </head>, <body>, </body>, and </html>
> > would make no difference and there's no particular reason to recommend
> > doing so as far as I'm aware.
> 
> And yet most people do it. Why?

There may be several reasons:

* Historically, some browsers differed in which end tags were actually
  optional. Since (AFAIK) no mainstream browser ever implemented a real
  SGML parser (they were always "tag soup" parsers with lots of ad-hoc
  rules) this sometimes even changed within the same browser depending
  on context (e.g. a simple table might work but nested tables woudn't).
  So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
  So people got into the habit of always using end tags and writing
  empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
  look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
  without thinking about it.

> Are you saying that it's better to omit them all?

If you want to conserve keystrokes :-)

I think it doesn't matter. Both are valid.

> More importantly: Would you omit all the </p> closing tags you can, or
> would you include them?

I usually write them. I also indent the contents of an element, so I
would write your example as:

<!DOCTYPE html>
<html>
  <body>
    Hello, world!
    <p>
      Paragraph 2
    </p>
    <p>
      Hey look, a third paragraph!
    </p>
  </body>
</html>

(As you can see I would also include the body tags to make that element
explicit. I would normally also add a bit of boilerplate (especially a
head with a charset and viewport definition), but I omit them here since
they would change the parse tree)

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20221024/1b043256/attachment.sig>