[Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?
Marc Tompkins
marc.tompkins at gmail.com
Wed Jun 6 10:21:08 CEST 2012
On Tue, Jun 5, 2012 at 11:22 PM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> You can do this:
>
> connection = urllib2.urlopen(url)
> tree = etree.parse(connection, my_html_parser)
>
> Alternatively, use fromstring() to parse from strings:
>
> page = urllib2.urlopen(url)
> pagecontents = page.read()
> html_root = etree.fromstring(pagecontents, my_html_parser)
>
>
Thank you! fromstring() did the trick for me.
Interestingly, your first suggestion - parsing straight from the connection
without an intermediate read() - appears to create the tree successfully,
but my first strip_tags() fails, with the error "ValueError: Input object
has no document: lxml.etree._ElementTree". Since fromstring() works just
fine, I will set this aside as a mystery for my copious free time (after
this project is done, for example.)
> See the lxml tutorial.
I did - I've been consulting it religiously - but I missed the fact that I
was mixing strings with file-like IO, and (as you mentioned) the error
message really wasn't helping me figure out my problem. Perhaps I should
have figured it out from the fact that the character value and position
change, even though the webpage doesn't... but no.
> Also note that there's lxml.html, which provides an
> extended tool set for HTML processing.
>
I've been using lxml.etree because I'm used to the syntax, and because
(perhaps mistakenly) I was under the impression that its parser was more
resilient in the face of broken HTML - this page has unclosed tags all over
the place. I'll try lxml.html, but (again) it'll have to be some time
later.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120606/1dfff498/attachment.html>
More information about the Tutor
mailing list