Parsing a HTML file for links?
William Park
parkw at better.net
Wed May 5 02:27:45 EDT 1999
On Tue, May 04, 1999 at 09:59:30PM -0700, Zigron wrote:
> I've never used the HTMLParser class(or SGML?), or the formatter thing,
> et al, and they confuse me a little.
>
> What I want to do is go through a HTML file, and spit out a
> dictionary based on the links, and title of the file. I want a dictionary,
> I guess, of like,
> {"text-between-anchor-tags":["Destination1","DestinationN.."]}
>
> The dictionary has a list in it because the same text might have more then
> one destination... I can't figure out how to get this to work :) Any one
> have any ideas?
>
> --Stephen
Well, I've never used it either. But, the solution to your problem
seems to be
- search for '<a ... href=xxx ...>yyy</a>', then
- store 'xxx' and 'yyy' in dictionary.
Assuming the pattern occurs all in one line for simplicity,
data = {}
s = open('abc.html', 'r').read()
for m in re.findall('<a .*?href=(.*?) .*?>(.*?)</a>', s):
link, text = m.group(1), m.group(2)
if data.has_key(text):
data[text].append(link)
else:
data[text] = [link]
To search across newlines, use DOTALL flag in re.compile().
William Park
More information about the Python-list
mailing list