Web page special characters encoding
MRAB
python at mrabarnett.plus.com
Sat Jul 10 13:09:12 EDT 2010
mattia wrote:
> Hi all, I'm using py3k and the urllib package to download web pages. Can
> you suggest me a package that can translate reserved characters in html
> like "è", "ò", "é" in the corresponding correct
> encoding?
>
import re
from html.entities import entitydefs
# The downloaded web page will be bytes, so decode it to a string.
webpage = downloaded_page.decode("iso-8859-1")
# Then decode the HTML entities.
webpage = re.sub(r"&(\w+);", lambda m: entitydefs[m.group(1)], webpage)
More information about the Python-list
mailing list