
Hello, I am having trouble using lxml.html.html5parser[1] with Python 3. line 147, in fromstring guess_charset=guess_charset) File "/home/paul/.virtualenvs/lxml.py35/lib/python3.5/site-packages/lxml/html/html5parser.py", line 64, in document_fromstring return parser.parse(html, useChardet=guess_charset).getroot() File "/home/paul/.virtualenvs/lxml.py35/lib/python3.5/site-packages/html5lib/html5parser.py", line 235, in parse self._parse(stream, False, None, *args, **kwargs) File "/home/paul/.virtualenvs/lxml.py35/lib/python3.5/site-packages/html5lib/html5parser.py", line 85, in _parse self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs) File "/home/paul/.virtualenvs/lxml.py35/lib/python3.5/site-packages/html5lib/_tokenizer.py", line 36, in __init__ self.stream = HTMLInputStream(stream, **kwargs) File "/home/paul/.virtualenvs/lxml.py35/lib/python3.5/site-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream return HTMLUnicodeInputStream(source, **kwargs) TypeError: __init__() got an unexpected keyword argument 'useChardet' I don't see any doc mention of bytes or unicode input restriction on lxml.html.html5parser.fromstring and related, and in fact the code tests for both[3] bytes and unicode/str in Python 3. So I believe it's a bug. I also see that html5parser tests are not run on Travis, which I enabled on my fork of lxml just to see, effectively showing the same issue[2] I'm having locally, namely the use of 'useChardet' argument for Unicode strings, which html5lib's HTML parser only accepts for bytes input. There's another seperate issue with bytes input with html5parser.fromstring() (not with document_fromstring()) line 151, in fromstring if start.startswith('<html') or start.startswith('<!doctype'): TypeError: startswith first arg must be bytes or a tuple of bytes, not str html5parser.document_fromstring(b'<html><body><p>test</p></body></html>') <Element {http://www.w3.org/1999/xhtml}html at 0x7f7000219b48>
Info on my setup: lxml : 3.6.4.0 libxml2 : 2.9.4 Python : 3.5.2 (default, Jul 5 2016, 12:43:10) - [GCC 5.4.0 20160609] Platform : Linux-4.4.0-34-generic-x86_64-with-Ubuntu-16.04-xenial Let me know if I should file a bug on Launchpad or if this is tracked already (I could not find anything similar in Launchpad). Cheers, /Paul. [1] http://lxml.de/html5parser.html [2] https://travis-ci.org/redapple/lxml/jobs/154796295 [3] https://github.com/lxml/lxml/blob/master/src/lxml/html/html5parser.py#L58
participants (1)
-
Paul Tremberth