[Tutor] How do I make pattern to find only '.html' file using Python Regular Expression?

Alan Gauld alan.gauld at btinternet.com
Thu Apr 2 01:02:17 CEST 2015


On 01/04/15 20:22, Abdullah Al Imran wrote:
> I have some HTML content where there are many links as the following pattern:
>
> <a href="http://example.com/2013/01/problem1.html">Problem No-1</a><br />
>
> I want to filter all the links  into a list as:
> ['http://example.com/2013/01/problem1.html', 'http://example.com/2013/02/problem2.html']
>
> How to do it using Python Regular Expression?

You can try, but regular expressions are not a reliable way
to parse HTML.

You are much better to use a dedicated HTML parser such
as the one  found in  htmllib in the standard library or
a third party tool like BeautifulSoup.

These recognise the different tag types and separate the content
and data for you. You can then just ask for the parser to
find <a...> tags and then fetch the data from each tag.

HTH
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list