[Tutor] Avoiding repetetive pattern match in re module

Kent Johnson kent37 at tds.net
Fri Jan 6 14:06:08 CET 2006


Intercodes wrote:
> Hello everyone,
> 
>     Iam new to this mailing list as well as python(uptime-3 weeks).Today 
> I learnt about RE from http://www.amk.ca/python/howto/regex/ 
> <http://www.amk.ca/python/howto/regex/%22RE%27s>.This one was really 
> helpful. I started working out with few examples on my own. The first 
> one was to collect all the HTML tags used in an HTML file.
> 
> I get the output but with tags repeated. I want to display all the tags 
> used in a file ,but no repetitions.Say the output to one of the HTML 
> file I got was : "<html><link> <a><br><a><br>"

You might consider Beautiful Soup or another HTML parser to collect the 
tags. Then use a set to find unique tags. For example (Python 2.4 version),

  >>> import urllib
  >>> from BeautifulSoup import BeautifulSoup as BS
  >>> data = urllib.urlopen('http://www.python.org').read()
  >>> bs = BS(data)
  >>> help(bs.fetch)
Help on method fetch in module BeautifulSoup:

fetch(self, name=None, attrs={}, recursive=True, text=None, limit=None) 
method of BeautifulSoup.BeautifulSoup instance
     Extracts a list of Tag objects that match the given
     criteria.  You can specify the name of the Tag and any
     attributes you want the Tag to have.

  >>> tags = set(tag.name for tag in bs.fetch())
  >>> sorted(tags)
['a', 'b', 'body', 'br', 'center', 'div', 'font', 'form', 'h4', 'head', 
'html', 'i', 'img', 'input', 'li', 'link', 'meta', 'p', 'small', 
'table', 'td', 'title', 'tr', 'ul']

http://www.crummy.com/software/BeautifulSoup/index.html
Kent



More information about the Tutor mailing list