[Tutor] Avoiding repetetive pattern match in re module
Kent Johnson
kent37 at tds.net
Fri Jan 6 14:06:08 CET 2006
Intercodes wrote:
> Hello everyone,
>
> Iam new to this mailing list as well as python(uptime-3 weeks).Today
> I learnt about RE from http://www.amk.ca/python/howto/regex/
> <http://www.amk.ca/python/howto/regex/%22RE%27s>.This one was really
> helpful. I started working out with few examples on my own. The first
> one was to collect all the HTML tags used in an HTML file.
>
> I get the output but with tags repeated. I want to display all the tags
> used in a file ,but no repetitions.Say the output to one of the HTML
> file I got was : "<html><link> <a><br><a><br>"
You might consider Beautiful Soup or another HTML parser to collect the
tags. Then use a set to find unique tags. For example (Python 2.4 version),
>>> import urllib
>>> from BeautifulSoup import BeautifulSoup as BS
>>> data = urllib.urlopen('http://www.python.org').read()
>>> bs = BS(data)
>>> help(bs.fetch)
Help on method fetch in module BeautifulSoup:
fetch(self, name=None, attrs={}, recursive=True, text=None, limit=None)
method of BeautifulSoup.BeautifulSoup instance
Extracts a list of Tag objects that match the given
criteria. You can specify the name of the Tag and any
attributes you want the Tag to have.
>>> tags = set(tag.name for tag in bs.fetch())
>>> sorted(tags)
['a', 'b', 'body', 'br', 'center', 'div', 'font', 'form', 'h4', 'head',
'html', 'i', 'img', 'input', 'li', 'link', 'meta', 'p', 'small',
'table', 'td', 'title', 'tr', 'ul']
http://www.crummy.com/software/BeautifulSoup/index.html
Kent
More information about the Tutor
mailing list