Python HTML parser chokes on UTF-8 input

Thu Oct 9 18:03:37 EDT 2008

Johannes Bauer wrote:
> Hello group,
> 
> I'm trying to use a htmllib.HTMLParser derivate class to parse a website
> which I fetched via
> httplib.HTTPConnection().request().getresponse().read(). Now the problem
> is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
> code is something like this:

I believe you are confusing unicode with unicode encoded into bytes with 
the UTF-8 encoding.  Having a problem feeding a unicode string, not 
'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.
> 
> prs = self.parserclass(formatter.NullFormatter())
> prs.init()
> prs.feed(website)
> self.__result = prs.get()
> prs.close()
> 
> Now when I take "website" directly from the parser, everything is fine.
> However I want to do some modifications before I parse it, namely UTF-8
> modifications in the style:
> 
> website = website.replace(u"föö", u"bär")
> 
> Therefore, after fetching the web site content, I have to convert it to
> UTF-8 first, modify it and convert it back:
> 
> website = website.decode("latin1") # produces unicode
> website = website.replace(u"föö", u"bär") #remains unicode
> website = website.encode("latin1") # produces byte string  in the latin-1 encoding
> 
> This is incredibly ugly IMHO, as I would really like the parser to just
> accept UTF-8 input.

To me, code that works is prettier than code that does not.

In 3.0, text strings are unicode, and I believe that is what the parser 
now accepts.

>However when I omit the reecoding to latin1:
> 
>   File "CachedWebParser.py", line 13, in __init__
>     self.__process(website)
>   File "CachedWebParser.py", line 55, in __process
>     prs.feed(website)
>   File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
>     self.goahead(0)
>   File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
>     k = self.parse_starttag(i)
>   File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
>     self._convert_ref, attrvalue)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
> ordinal not in range(128)

When you do not bother to specify some other encoding in an encoding 
operation, sgmllib or something deeper in Python tries the default 
encoding, which does not work.  Stop being annoyed and tell the 
interpreter what you want.  It is not a mind-reader.

> Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
> input - which should (again, IMHO) be the absolute standard for such a
> new language.

The first version of Python came out in 1989, I believe, years before 
unicode.  One of the features of the new 3.0 version is that is uses 
unicode as the standard for text.

Terry Jan Reedy