[Tutor] How do I make pattern to find only '.html' file using Python Regular Expression?
Alan Gauld
alan.gauld at btinternet.com
Thu Apr 2 01:02:17 CEST 2015
On 01/04/15 20:22, Abdullah Al Imran wrote:
> I have some HTML content where there are many links as the following pattern:
>
> <a href="http://example.com/2013/01/problem1.html">Problem No-1</a><br />
>
> I want to filter all the links into a list as:
> ['http://example.com/2013/01/problem1.html', 'http://example.com/2013/02/problem2.html']
>
> How to do it using Python Regular Expression?
You can try, but regular expressions are not a reliable way
to parse HTML.
You are much better to use a dedicated HTML parser such
as the one found in htmllib in the standard library or
a third party tool like BeautifulSoup.
These recognise the different tag types and separate the content
and data for you. You can then just ask for the parser to
find <a...> tags and then fetch the data from each tag.
HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list