Web page special characters encoding

mattia gervaz at gmail.com
Sat Jul 10 19:17:03 EDT 2010


Il Sat, 10 Jul 2010 16:24:23 +0000, mattia ha scritto:

> Hi all, I'm using py3k and the urllib package to download web pages. Can
> you suggest me a package that can translate reserved characters in html
> like "è", "ò", "é" in the corresponding correct
> encoding?
> 
> Thanks,
> Mattia

Basically I'm trying to get an html page and stripping out all the tags 
to obtain just plain text. John Nagle and Christian Heimes somehow 
figured out what I'm trying to do ;-)

So far what I've done, thanks to you suggestions:

import lxml.html
import lxml.html.clean
import urllib.request
import urllib.parse
from html.entities import entitydefs
import re
import sys

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; 
rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"}

def replace(m):
    if m.group(1) in entitydefs:
        return entitydefs[m.group(1)]
    else:
        return m.group(1)

def test(page):
    req = urllib.request.Request(page, None, HEADERS)
    page = urllib.request.urlopen(req)
    charset = page.info().get_content_charset()
    if charset is not None:
        html = page.read().decode(charset)
    else:
        html = page.read().decode("iso-8859-1")
    html = re.sub(r"&(\w+);", replace, html)
    cleaner = lxml.html.clean.Cleaner(safe_attrs_only = True, style = 
True)
    html = cleaner.clean_html(html)
    # create the element tree
    tree = lxml.html.document_fromstring(html)
    txt = tree.text_content()
    for x in txt.split():
        # DOS shell is not able to print characters like u'\u20ac' - 
why???
        try:
            print(x)
        except:
            continue

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage:", sys.argv[0], "<webpage>")
        print("Example:", sys.argv[0], "http://www.bing.com")
        sys.exit()
    test(sys.argv[1])

Every new tips will be appreciated.

Ciao,
Mattia



More information about the Python-list mailing list