Python HTML parser chokes on UTF-8 input
Terry Reedy
tjreedy at udel.edu
Thu Oct 9 18:03:37 EDT 2008
Johannes Bauer wrote:
> Hello group,
>
> I'm trying to use a htmllib.HTMLParser derivate class to parse a website
> which I fetched via
> httplib.HTTPConnection().request().getresponse().read(). Now the problem
> is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
> code is something like this:
I believe you are confusing unicode with unicode encoded into bytes with
the UTF-8 encoding. Having a problem feeding a unicode string, not
'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.
>
> prs = self.parserclass(formatter.NullFormatter())
> prs.init()
> prs.feed(website)
> self.__result = prs.get()
> prs.close()
>
> Now when I take "website" directly from the parser, everything is fine.
> However I want to do some modifications before I parse it, namely UTF-8
> modifications in the style:
>
> website = website.replace(u"föö", u"bär")
>
> Therefore, after fetching the web site content, I have to convert it to
> UTF-8 first, modify it and convert it back:
>
> website = website.decode("latin1") # produces unicode
> website = website.replace(u"föö", u"bär") #remains unicode
> website = website.encode("latin1") # produces byte string in the latin-1 encoding
>
> This is incredibly ugly IMHO, as I would really like the parser to just
> accept UTF-8 input.
To me, code that works is prettier than code that does not.
In 3.0, text strings are unicode, and I believe that is what the parser
now accepts.
>However when I omit the reecoding to latin1:
>
> File "CachedWebParser.py", line 13, in __init__
> self.__process(website)
> File "CachedWebParser.py", line 55, in __process
> prs.feed(website)
> File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
> self.goahead(0)
> File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
> k = self.parse_starttag(i)
> File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
> self._convert_ref, attrvalue)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
> ordinal not in range(128)
When you do not bother to specify some other encoding in an encoding
operation, sgmllib or something deeper in Python tries the default
encoding, which does not work. Stop being annoyed and tell the
interpreter what you want. It is not a mind-reader.
> Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
> input - which should (again, IMHO) be the absolute standard for such a
> new language.
The first version of Python came out in 1989, I believe, years before
unicode. One of the features of the new 3.0 version is that is uses
unicode as the standard for text.
Terry Jan Reedy
More information about the Python-list
mailing list