[issue23144] html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text
Ezio Melotti added the comment:
I still think it would be worthwhile adding close() calls to the examples in the documentation (Doc/library/html.parser.rst).
If I add context manager support to HTMLParser I can update the examples to use it, but otherwise I don't think it's worth changing them now.
BTW I haven’t tested this, and maybe it is not a concern, but even with this patch it looks like the parser will buffer unlimited data and output nothing until close() if each string it is fed ends with an ampersand (and otherwise contains only plain text, no tags etc).
This is true, but I don't think it's a realistic case. For this to be a problem you would need: 1) Someone feeding the parser with arbitrary chunks. Text files are usually fed to the parser whole, or line by line -- arbitrary chunks are uncommon. 2) A file that contains lot of entities. In most documents charrefs are not very common, and so the chances that a chunk will split one in the middle is low. Chances that several consecutive charrefs are split in the middle is even lower. 3) A file that is very big. Even if all the file is buffered until a call to close(), it shouldn't be a concern, since most files have relatively small size. It is true that this has a quadratic complexity, but I would expect the parsing to complete in a reasonable time for average sizes. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue23144> _______________________________________
participants (1)
-
Ezio Melotti