HTMLParser and HTML5
Hello all, I wanted to ask a few questions and start a discussion about HTML5 support within the HTMLParser class(es). Over on issue 670664, an inconsistency with the way browsers and the HTMLParser parse script and style tags was discovered. Currently, HTMLParser adheres strictly to the HTML4 standard, which says that these tags should exit CDATA mode when the start of *any* closing tag is found. No browsers, to my knowledge, have ever supported this (at least in the 21st century). Instead, all browsers implement the behavior described in the HTML5 spec, which states that script tags should exit their "raw text mode" when the full closing tag for that element is encountered. The repercussions of adhering to the HTML4 standard in HTMLParser are somewhat serious: a good number of documents will either encounter exceptions for broken markup (which aren't actually broken). Libraries like Beautiful Soup (which depend on HTMLParser) are also affected, requiring the use of hacks just to get the document to parse at all. Rather than bore you all with another paragraph about how HTML4 is terrible, feel free to look at the issue (http://bugs.python.org/issue670664), which quite thoroughly outlines the pros and cons of this particular change. Any feedback/input on the proposed changes is welcome. So here are my questions: - What plans, if any, are there to support HTML5 parsing behaviors, since the HTML5 spec effectively describes current web browser behavior? - What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)? Given the semi-backward-compatible nature of HTML5's syntax, this seems like a rather unique problem that could use some more discussion. Thanks Matt Basta
On Thu, Jul 28, 2011 at 11:25, Matt <mattbasta@gmail.com> wrote:
Hello all,
I wanted to ask a few questions and start a discussion about HTML5 support within the HTMLParser class(es). Over on issue 670664, an inconsistency with the way browsers and the HTMLParser parse script and style tags was discovered. Currently, HTMLParser adheres strictly to the HTML4 standard, which says that these tags should exit CDATA mode when the start of *any* closing tag is found. No browsers, to my knowledge, have ever supported this (at least in the 21st century). Instead, all browsers implement the behavior described in the HTML5 spec, which states that script tags should exit their "raw text mode" when the full closing tag for that element is encountered.
The repercussions of adhering to the HTML4 standard in HTMLParser are somewhat serious: a good number of documents will either encounter exceptions for broken markup (which aren't actually broken). Libraries like Beautiful Soup (which depend on HTMLParser) are also affected, requiring the use of hacks just to get the document to parse at all.
Rather than bore you all with another paragraph about how HTML4 is terrible, feel free to look at the issue (http://bugs.python.org/issue670664), which quite thoroughly outlines the pros and cons of this particular change. Any feedback/input on the proposed changes is welcome.
So here are my questions:
- What plans, if any, are there to support HTML5 parsing behaviors, since the HTML5 spec effectively describes current web browser behavior?
There are not specific plans that have been publicly brought up (to my knowledge).
- What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)?
There aren't any beyond "it would be nice".
Given the semi-backward-compatible nature of HTML5's syntax, this seems like a rather unique problem that could use some more discussion.
It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). IOW there is no policies specifically about this topic beyond the general desire to stay up-to-date with stable specs.
Brett Cannon, 28.07.2011 23:49:
On Thu, Jul 28, 2011 at 11:25, Matt wrote:
- What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)?
There aren't any beyond "it would be nice". [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code).
Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest. I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it. Stefan
On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Brett Cannon, 28.07.2011 23:49:
On Thu, Jul 28, 2011 at 11:25, Matt wrote:
- What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)?
There aren't any beyond "it would be nice". [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code).
Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest.
I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it.
I disaagree. Having proper html parsing out of the box is part of the "batteries included" thing. And it is not a matter of "having html 5" - as stated on this thread, fixing it for html5 will fix it for html that exists in the "real world". Python _has_ to work with quick 30-50 lines scripts deliverable everywhere, not just has proper 3rd party libraries that can work as part of a huge project using buildout. js -><-
Stefan
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/jsbueno%40python.org.br
Joao S. O. Bueno, 29.07.2011 13:22:
On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote:
Brett Cannon, 28.07.2011 23:49:
On Thu, Jul 28, 2011 at 11:25, Matt wrote:
- What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)?
There aren't any beyond "it would be nice". [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code).
Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest.
I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it.
I disaagree. Having proper html parsing out of the box is part of the "batteries included" thing.
Well, you can easily prove me wrong by implementing this. Stefan
On Jul 29, 2011, at 7:46 AM, Stefan Behnel wrote:
Joao S. O. Bueno, 29.07.2011 13:22:
On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote:
Brett Cannon, 28.07.2011 23:49:
On Thu, Jul 28, 2011 at 11:25, Matt wrote:
- What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)?
There aren't any beyond "it would be nice". [...] It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code).
Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest.
I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it.
I disaagree. Having proper html parsing out of the box is part of the "batteries included" thing.
Well, you can easily prove me wrong by implementing this.
Stefan
Please don't implement this just to profe Stefan wrong :). The thing to do, if you want html parsing in the stdlib, is to _incorporate_ html5lib, which is already a perfectly good, thoroughly tested HTML parser, and simply deprecate HTMLParser and friends. Implementing a new parser would serve no purpose I can see. -glyph
On Fri, Jul 29, 2011 at 11:03 AM, Glyph Lefkowitz <glyph@twistedmatrix.com>wrote:
On Jul 29, 2011, at 7:46 AM, Stefan Behnel wrote:
Joao S. O. Bueno, 29.07.2011 13:22:
On Fri, Jul 29, 2011 at 1:37 AM, Stefan Behnel wrote:
Brett Cannon, 28.07.2011 23:49:
On Thu, Jul 28, 2011 at 11:25, Matt wrote:
- What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)?
There aren't any beyond "it would be nice". [...] It's more of an issue of someone caring enough to do the coding work
to
bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code).
Which, given that html5lib readily exists, would likely be a lot more work than anyone who is interested in HTML5 handling would want to invest.
I don't think we need a new HTML5 parsing implementation only to have it in the stdlib. That's the old sunny Java way of doing it.
I disaagree. Having proper html parsing out of the box is part of the "batteries included" thing.
Well, you can easily prove me wrong by implementing this.
As far as the issue described in my initial message goes, there is a patch and tests for the patch.
Please don't implement this just to profe Stefan wrong :).
The thing to do, if you want html parsing in the stdlib, is to _incorporate_ html5lib, which is already a perfectly good, thoroughly tested HTML parser, and simply deprecate HTMLParser and friends. Implementing a new parser would serve no purpose I can see.
I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec. As far as structure goes, HTML4 and HTML5 are practically identical. The differences between the two that are applicable to HTMLParser involve the way the specs deal with special element types and broken syntax. For what it's worth, the rules HTML4 does define are (in many cases) ignored in favor of more modern, Postel's Law-agreeable rules. HTML5 simply standardized what browsers actually do. Deprecating HTMLParser in favor of a newer/better/faster HTML library is a bad thing for everybody that's already using HTMLParser, whether directly or indirectly. html5lib does not have an interface compatible with HTMLParser, so code would largely need to be rewritten from scratch to gain the benefits of HTML5's support for broken code. Developers using HTMLParser would be permanently stuck using a library that throws exceptions for perfectly valid HTML. Keep in mind that these are solved problems: all of the thinking on how to handle broken code has been done for us by the folks at the WHATWG. It's simply a matter of updating our existing code with these new rules. While I agree that there are merits to dropping support for the old code, it does not solve the existing problems that folks are having right now (namely incorrect parser output or exceptions). It would be more ideal to perhaps patch the obvious issues stemming from HTML4 support for now, leaving anything that goes beyond parity with browsers for a later time or implementing as an opt-in feature (i.e.: enabled by a parameter). Matt
On Jul 29, 2011, at 3:00 PM, Matt wrote:
I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec.
I am not really one to throw stones here, as Twisted contains a lenient pseudo-XML parser which I still maintain - one which decidedly does not agree with html5's requirements for dealing with invalid data, but just a bunch of ad-hoc guesses of my own. My impression of HTML5 is that HTMLParser would require significant modifications and possibly a drastic re-architecture in order to really do HTML5 "right"; especially the parts that the html5lib authors claim makes HTML5 streaming-unfriendly, i.e. subtree reordering when encountering certain types of invalid data. But if I'm wrong about that, and there are just a few spec updates and bugfixes that need to be applied, by all means, ignore my comment. -glyph
On Fri, Jul 29, 2011 at 13:16, Glyph Lefkowitz <glyph@twistedmatrix.com>wrote:
On Jul 29, 2011, at 3:00 PM, Matt wrote:
I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec.
I am not really one to throw stones here, as Twisted contains a lenient pseudo-XML parser which I still maintain - one which decidedly does *not* agree with html5's requirements for dealing with invalid data, but just a bunch of ad-hoc guesses of my own.
My impression of HTML5 is that HTMLParser would require significant modifications and possibly a drastic re-architecture in order to really do HTML5 "right"; especially the parts that the html5lib authors claim makes HTML5 streaming-unfriendly, i.e. subtree reordering when encountering certain types of invalid data.
We could also have the code live side-by-side for a while (or indefinitely if that was really desired) by bringing html5lib in as either a separate module or having the relevant classes live in htmllib under different names. But all of this is just hypothetical until someone decides to do the legwork to actually make a proposal and get the coding done. -Brett
But if I'm wrong about that, and there are just a few spec updates and bugfixes that need to be applied, by all means, ignore my comment.
-glyph
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org
On Fri, 29 Jul 2011 13:34:13 -0700 Brett Cannon <brett@python.org> wrote:
On Fri, Jul 29, 2011 at 13:16, Glyph Lefkowitz <glyph@twistedmatrix.com>wrote:
On Jul 29, 2011, at 3:00 PM, Matt wrote:
I don't see any real reason to drop a decent piece of code (HTMLParser, that is) in favor of a third party library when only relatively minor updates are needed to bring it up to speed with the latest spec.
I am not really one to throw stones here, as Twisted contains a lenient pseudo-XML parser which I still maintain - one which decidedly does *not* agree with html5's requirements for dealing with invalid data, but just a bunch of ad-hoc guesses of my own.
My impression of HTML5 is that HTMLParser would require significant modifications and possibly a drastic re-architecture in order to really do HTML5 "right"; especially the parts that the html5lib authors claim makes HTML5 streaming-unfriendly, i.e. subtree reordering when encountering certain types of invalid data.
We could also have the code live side-by-side for a while (or indefinitely if that was really desired) by bringing html5lib in as either a separate module or having the relevant classes live in htmllib under different names.
Unless html5lib is better in some fundamental ways which are difficult to fix in htmllib, I'm not sure there's any point in adding it to the stdlib. We don't really do users a service if we keep adding alternative APIs for common functionality. Regards Antoine.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/29/2011 07:22 AM, Joao S. O. Bueno wrote:
I disaagree. Having proper html parsing out of the box is part of the "batteries included" thing. And it is not a matter of "having html 5" - as stated on this thread, fixing it for html5 will fix it for html that exists in the "real world".
Python _has_ to work with quick 30-50 lines scripts deliverable everywhere, not just has proper 3rd party libraries that can work as part of a huge project using buildout.
Assuming it were merged today, that parser would only be available on Python 3.3 and later: how is that "everywhere"? Having scripts that work against html5lib (which *doesn't* need buildout to install, or even setuptools) makes them portable to any version of Python supported by the library (Python 2.3+, AFAICT). Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4y/JYACgkQ+gerLs4ltQ4KKwCgkyOlmb8xxhxg1qWH9RRbEpEw ne0AoL6NgRElbY61QRqnXJjiKoHq0ToW =fk3k -----END PGP SIGNATURE-----
On Fri, Jul 29, 2011 at 11:31, Tres Seaver <tseaver@palladion.com> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/29/2011 07:22 AM, Joao S. O. Bueno wrote:
I disaagree. Having proper html parsing out of the box is part of the "batteries included" thing. And it is not a matter of "having html 5" - as stated on this thread, fixing it for html5 will fix it for html that exists in the "real world".
Python _has_ to work with quick 30-50 lines scripts deliverable everywhere, not just has proper 3rd party libraries that can work as part of a huge project using buildout.
Assuming it were merged today, that parser would only be available on Python 3.3 and later: how is that "everywhere"?
Well, "everywhere, eventually". This gets down to the usual philosophical debate of what should (not) be in the stdlib so that those who have strict third-party code get access to useful libraries while balancing the desire of those who want to keep the stdlib lean or prevent stagnating the API of a module.
Having scripts that work against html5lib (which *doesn't* need buildout to install, or even setuptools) makes them portable to any version of Python supported by the library (Python 2.3+, AFAICT).
If the library was brought in they could probably continue to be portable with possibly just the addition of a try/finally on the import line. -Brett
Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk4y/JYACgkQ+gerLs4ltQ4KKwCgkyOlmb8xxhxg1qWH9RRbEpEw ne0AoL6NgRElbY61QRqnXJjiKoHq0ToW =fk3k -----END PGP SIGNATURE-----
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org
participants (7)
-
Antoine Pitrou -
Brett Cannon -
Glyph Lefkowitz -
Joao S. O. Bueno -
Matt -
Stefan Behnel -
Tres Seaver