Python HTML parser chokes on UTF-8 input
Johannes Bauer
dfnsonfsduifb at gmx.de
Thu Oct 9 16:54:59 EDT 2008
Hello group,
I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:
prs = self.parserclass(formatter.NullFormatter())
prs.init()
prs.feed(website)
self.__result = prs.get()
prs.close()
Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:
website = website.replace(u"föö", u"bär")
Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:
website = website.decode("latin1")
website = website.replace(u"föö", u"bär")
website = website.encode("latin1")
This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input. However when I omit the reecoding to latin1:
File "CachedWebParser.py", line 13, in __init__
self.__process(website)
File "CachedWebParser.py", line 55, in __process
prs.feed(website)
File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)
Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.
Can I do something about it?
Regards,
Johannes
--
"Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
-- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
<48d8bf1d$0$7510$5402220f at news.sunrise.ch>
More information about the Python-list
mailing list