Mailman 3 [issue23144] html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text - docs

7 Mar 2015


      Ezio Melotti added the comment:
...
I still think it would be worthwhile adding close() calls to
the examples in the documentation (Doc/library/html.parser.rst).
If I add context manager support to HTMLParser I can update the examples to use it, but otherwise I don't think it's worth changing them now.
...
BTW I haven’t tested this, and maybe it is not a concern, but even with
this patch it looks like the parser will buffer unlimited data and
output nothing until close() if each string it is fed ends with an 
ampersand (and otherwise contains only plain text, no tags etc).
This is true, but I don't think it's a realistic case.
For this to be a problem you would need:
1) Someone feeding the parser with arbitrary chunks.  Text files are usually fed to the parser whole, or line by line -- arbitrary chunks are uncommon.
2) A file that contains lot of entities.  In most documents charrefs are not very common, and so the chances that a chunk will split one in the middle is low.  Chances that several consecutive charrefs are split in the middle is even lower.
3) A file that is very big.  Even if all the file is buffered until a call to close(), it shouldn't be a concern, since most files have relatively small size.  It is true that this has a quadratic complexity, but I would expect the parsing to complete in a reasonable time for average sizes.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue23144>
_______________________________________

[issue23144] html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text

Ezio Melotti

tags

participants (1)