Beautiful Soup - close tags more promptly?
Peter J. Holzer
hjp-python at hjp.at
Mon Oct 24 13:17:55 EDT 2022
On 2022-10-25 03:09:33 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
> <python-list at python.org> wrote:
> > On 2022-10-24, Chris Angelico <rosuav at gmail.com> wrote:
> > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > >> Yes, I got that. What I wanted to say was that this is indeed a bug in
> > >> html.parser and not an error (or sloppyness, as you called it) in the
> > >> input or ambiguity in the HTML standard.
> > >
> > > I described the HTML as "sloppy" for a number of reasons, but I was of
> > > the understanding that it's generally recommended to have the closing
> > > tags. Not that it matters much.
> >
> > Some elements don't need close tags, or even open tags. Unless you're
> > using XHTML you don't need them and indeed for the case of void tags
> > (e.g. <br>, <img>) you must not include the close tags.
>
> Yep, I'm aware of void tags, but I'm talking about the container tags
> - in this case, <li> and <p> - which, in a lot of older HTML pages,
> are treated as "separator" tags. Consider this content:
>
> <HTML>
> Hello, world!
> <P>
> Paragraph 2
> <P>
> Hey look, a third paragraph!
> </HTML>
>
> Stick a doctype onto that and it should be valid HTML5, but as it is,
> it's the exact sort of thing that was quite common in the 90s.
>
> The <p> tag is not a void tag, but according to the spec, it's legal
> to omit the </p> if the element is followed directly by another <p>
> element (or any of a specific set of others), or if there is no
> further content.
Right. The parser knows the structure of an HTML document, which tags
are optional and which elements can be inside of which other elements.
For SGML-based HTML versions (2.0 to 4.01) this is formally described by
the DTD.
So when parsing your file, an HTML parser would work like this
<HTML> - Yup, I expect an HTML element here:
HTML
Hello, world! - #PCDATA? Not allowed as a child of HTML. There must
be a HEAD and a BODY, both of which have optional start tags.
HEAD can't contain #PCDATA either, so we must be inside of BODY
and HEAD was empty:
HTML
├─ HEAD
└─ BODY
└─ Hello, world!
<P> - Allowed in BODY, so just add that:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
└─ P
Paragraph 2 - #PCDATA is allowed in P, so add it as a child:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
└─ P
└─ #PCDATA: Paragraph 2
<P> - Not allowed inside of P, so that implicitely closes the
previous P element and we go up one level:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
Hey look, a third paragraph! - Same as above:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
└─ #PCDATA: Hey look, a third paragraph!
</HTML> - The end tags of P and BODY are optional, so the end of
HTML closes them implicitely, and we have our final parse tree
(unchanged from the last step):
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
└─ #PCDATA: Hey look, a third paragraph!
For a human, the <p> tags might feel like separators here. But
syntactically they aren't - they start a new element. Note especially
that "Hello, world!" is not part of a P element but a direct child of
BODY (which may or may not be intended by the author).
>
> > Adding in the omitted <head>, </head>, <body>, </body>, and </html>
> > would make no difference and there's no particular reason to recommend
> > doing so as far as I'm aware.
>
> And yet most people do it. Why?
There may be several reasons:
* Historically, some browsers differed in which end tags were actually
optional. Since (AFAIK) no mainstream browser ever implemented a real
SGML parser (they were always "tag soup" parsers with lots of ad-hoc
rules) this sometimes even changed within the same browser depending
on context (e.g. a simple table might work but nested tables woudn't).
So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
So people got into the habit of always using end tags and writing
empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
without thinking about it.
> Are you saying that it's better to omit them all?
If you want to conserve keystrokes :-)
I think it doesn't matter. Both are valid.
> More importantly: Would you omit all the </p> closing tags you can, or
> would you include them?
I usually write them. I also indent the contents of an element, so I
would write your example as:
<!DOCTYPE html>
<html>
<body>
Hello, world!
<p>
Paragraph 2
</p>
<p>
Hey look, a third paragraph!
</p>
</body>
</html>
(As you can see I would also include the body tags to make that element
explicit. I would normally also add a bit of boilerplate (especially a
head with a charset and viewport definition), but I omit them here since
they would change the parse tree)
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp at hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20221024/1b043256/attachment.sig>
More information about the Python-list
mailing list