Web page special characters encoding
mattia
gervaz at gmail.com
Sat Jul 10 19:17:03 EDT 2010
Il Sat, 10 Jul 2010 16:24:23 +0000, mattia ha scritto:
> Hi all, I'm using py3k and the urllib package to download web pages. Can
> you suggest me a package that can translate reserved characters in html
> like "è", "ò", "é" in the corresponding correct
> encoding?
>
> Thanks,
> Mattia
Basically I'm trying to get an html page and stripping out all the tags
to obtain just plain text. John Nagle and Christian Heimes somehow
figured out what I'm trying to do ;-)
So far what I've done, thanks to you suggestions:
import lxml.html
import lxml.html.clean
import urllib.request
import urllib.parse
from html.entities import entitydefs
import re
import sys
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"}
def replace(m):
if m.group(1) in entitydefs:
return entitydefs[m.group(1)]
else:
return m.group(1)
def test(page):
req = urllib.request.Request(page, None, HEADERS)
page = urllib.request.urlopen(req)
charset = page.info().get_content_charset()
if charset is not None:
html = page.read().decode(charset)
else:
html = page.read().decode("iso-8859-1")
html = re.sub(r"&(\w+);", replace, html)
cleaner = lxml.html.clean.Cleaner(safe_attrs_only = True, style =
True)
html = cleaner.clean_html(html)
# create the element tree
tree = lxml.html.document_fromstring(html)
txt = tree.text_content()
for x in txt.split():
# DOS shell is not able to print characters like u'\u20ac' -
why???
try:
print(x)
except:
continue
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage:", sys.argv[0], "<webpage>")
print("Example:", sys.argv[0], "http://www.bing.com")
sys.exit()
test(sys.argv[1])
Every new tips will be appreciated.
Ciao,
Mattia
More information about the Python-list
mailing list