Parsing an HTML a tag

beza1e1 andreas.zwinkau at googlemail.com
Sat Sep 24 20:03:53 CEST 2005


I do not really know, what you want to do. Getting he urls from the a
tags of a html file? I think the easiest method would be a regular
expression.

>>>import urllib, sre
>>>html = urllib.urlopen("http://www.google.com").read()
>>>sre.findall('href="([^>]+)"', html)
['/imghp?hl=de&tab=wi&ie=UTF-8',
'http://groups.google.de/grphp?hl=de&tab=wg&ie=UTF-8',
'/dirhp?hl=de&tab=wd&ie=UTF-8',
'http://news.google.de/nwshp?hl=de&tab=wn&ie=UTF-8',
'http://froogle.google.de/frghp?hl=de&tab=wf&ie=UTF-8',
'/intl/de/options/']
>>> sre.findall('href=[^>]+>([^<]+)</a>', html)
['Bilder', 'Groups', 'Verzeichnis', 'News', 'Froogle',
'Mehr&nbsp;&raquo;', 'Erweiterte Suche', 'Einstellungen',
'Sprachtools', 'Werbung', 'Unternehmensangebote', 'Alles \xfcber
Google', 'Google.com in English']

Google has some strange html, href without quotation marks: <a
href=http://www.google.com/ncr>Google.com in English</a>




More information about the Python-list mailing list