Parsing a HTML file for links?

William Park parkw at better.net
Wed May 5 02:27:45 EDT 1999


On Tue, May 04, 1999 at 09:59:30PM -0700, Zigron wrote:
>     I've never used the HTMLParser class(or SGML?), or the formatter thing,
> et al, and they confuse me a little.
> 
>     What I want to do is go through a HTML file, and spit out a
> dictionary based on the links, and title of the file. I want a dictionary,
> I guess, of like,
> {"text-between-anchor-tags":["Destination1","DestinationN.."]}
> 
> The dictionary has a list in it because the same text might have more then
> one destination... I can't figure out how to get this to work :) Any one
> have any ideas?
> 
> --Stephen

Well, I've never used it either.  But, the solution to your problem
seems to be 
    - search for '<a ... href=xxx ...>yyy</a>', then 
    - store 'xxx' and 'yyy' in dictionary.

Assuming the pattern occurs all in one line for simplicity,

    data = {}
    s = open('abc.html', 'r').read()
    for m in re.findall('<a .*?href=(.*?) .*?>(.*?)</a>', s):
	link, text = m.group(1), m.group(2)
	if data.has_key(text):
	    data[text].append(link)
	else:
	    data[text] = [link]

To search across newlines, use DOTALL flag in re.compile().

William Park




More information about the Python-list mailing list