[Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?

Wed Jun 6 10:21:08 CEST 2012

On Tue, Jun 5, 2012 at 11:22 PM, Stefan Behnel <stefan_ml at behnel.de> wrote:

> You can do this:
>
>    connection = urllib2.urlopen(url)
>    tree = etree.parse(connection, my_html_parser)
>
> Alternatively, use fromstring() to parse from strings:
>
>    page = urllib2.urlopen(url)
>    pagecontents = page.read()
>     html_root = etree.fromstring(pagecontents, my_html_parser)
>
>
Thank you!  fromstring() did the trick for me.

Interestingly, your first suggestion - parsing straight from the connection
without an intermediate read() - appears to create the tree successfully,
but my first strip_tags() fails, with the error "ValueError: Input object
has no document: lxml.etree._ElementTree".  Since fromstring() works just
fine, I will set this aside as a mystery for my copious free time (after
this project is done, for example.)

> See the lxml tutorial.

I did - I've been consulting it religiously - but I missed the fact that I
was mixing strings with file-like IO, and (as you mentioned) the error
message really wasn't helping me figure out my problem.  Perhaps I should
have figured it out from the fact that the character value and position
change, even though the webpage doesn't... but no.

> Also note that there's lxml.html, which provides an
> extended tool set for HTML processing.
>

I've been using lxml.etree because I'm used to the syntax, and because
(perhaps mistakenly) I was under the impression that its parser was more
resilient in the face of broken HTML - this page has unclosed tags all over
the place.  I'll try lxml.html, but (again) it'll have to be some time
later.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120606/1dfff498/attachment.html>