[Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?

Wed Jun 6 03:10:39 CEST 2012

Hello all - it's been a while!

I'm trying to parse a webpage using lxml; every time I try, I'm
rewarded with "UnicodeDecodeError: 'ascii' codec can't decode byte
0x?? in position?????: ordinal not in range(128)"  (the byte value and
the position occasionally change; the error never does.)

The page's encoding is UTF-8:
     <meta http-equiv="content-type" content="text/html; charset=utf-8" />
so I have tried:
-  setting HTMLParser's encoding to 'utf-8'
-  reading the page first, decoding as 'utf-8', then re-encoding as
'ascii' with options 'replace' or 'ignore'
-  and various combinations thereof

Here's my current version, trying everything at once:

from __future__ import print_function
import datetime
import urllib2
from lxml import etree
url = 'http://www.wpc-edi.com/reference/codelists/healthcare/claim-adjustment-reason-codes/'
page = urllib2.urlopen(url)
pagecontents = page.read()
pagecontents = pagecontents.decode('utf-8')
pagecontents = pagecontents.encode('ascii', 'ignore')
tree = etree.parse(pagecontents,
etree.HTMLParser(encoding='utf-8',recover=True))

and here's the result:
Traceback (most recent call last):
 File "etreeTest.py", line 10, in <module>
tree = etree.parse(pagecontents,
etree.HTMLParser(encoding='utf-8',recover=True))
 File "lxml.etree.pyx", line 2942, in lxml.etree.parse
(src/lxml/lxml.etree.c:54187)
 File "parser.pxi", line 1528, in lxml.etree._parseDocument
(src/lxml/lxml.etree.c:79485)
 File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:79768)
 File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:78843)
 File "parser.pxi", line 997, in
lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698)
 File "parser.pxi", line 564, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:71739)
 File "parser.pxi", line 645, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:72614)
 File "parser.pxi", line 579, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:71894)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position
63953: ordinal not in range(128)
Script terminated.

I'm at my wit's end: how do I either change HTMLParser's codec to
UTF-8, or strip non-ASCII characters out of the stream?  What am I
missing?

Environment:
Python 2.7.3, 32bit - on Windows 7 Ultimate, 64bit
lxml 2.3