[Tutor] re module

Danny Yoo dyoo at hashcollision.org
Thu Aug 14 23:55:03 CEST 2014


Hi Sunil,

Don't use regular expressions for this task.  Use something that knows
about HTML structure.  As others have noted, the Beautiful Soup or
lxml libraries are probably a much better choice here.

There are good reasons to avoid regexp for the task you're trying to
do.  For example, your regular expression:

     "<span style=\"(.*)\"

does not respect the string boundaries of attributes.  You may think
that ".*" matches just content within a string attribute, but this is
not true.  For example, see the following example:

######################################################
>>> import re
>>> m = re.match("'(.*)'", "'quoted' text, but note how it's greedy!")
>>> m.group(1)
"quoted' text, but note how it"
######################################################

and note how the match doesn't limited itself to "quoted", but goes as
far as it can.

This shows at least one of the problems that you're going to run into.
Fixing this so it doesn't grab so much is doable, of course.  But
there are other issues, all of which are little headaches upon
headaches.  (e.g. Attribute vlaues may be single or double quoted, may
use HTML entity references, etc.)

So don't try to parse HTML by hand.  Let a library do it for you.  For
example with Beautiful Soup:

    http://www.crummy.com/software/BeautifulSoup/bs4/doc/

the code should be as straightforward as:

###########################
from bs4 import BeautifulSoup
soup = BeautifulSoup(stmt)
for span in soup.find_all('span'):
    print span.get('style')
###########################

where you deal with the _structure_ of your document, rather than at
the low-level individual characters of that document.


More information about the Tutor mailing list