[Tutor] Regexp question

Magnus Lyckå magnus@thinkware.se
Mon May 19 11:37:01 2003


At 17:30 2003-05-19 +0300, Ovidiu Bivolaru wrote:
>Hi all,
>
>I'm trying to parse some HTML forms to get the values from "name" and
>"value" attributes and then to add them in a list. I'm encountering a
>problem with the regular expressions and I can't figure out why the
>expression is invalid.
>
>Bellow is the code that I'm using:
>     for lines in buffer.readlines():
>       print lines
>       regexp = 'value="(.*)?"\s*name\s*=\s*"(.*:\d+:\d+)?"'

In general you need to double your backslashes or use raw strings,
or python will try to interpret bacslashes instead of passing them to
the re module. In these particular cases I don't think it matters,
because \s and \d don't mean anything, but in general, using raw
strings for re's is best.

Also, and more importantly in this case, for non-greedy searches,
the ? should come directly after the + or *; not after the ) which
ends the group. Also, it's the '.*' which is nasty! In other words:

regexp = r'value="(.*?)"\s*name\s*=\s*"(.*?:\d+:\d+)"'

The ? to make the .*'s non-greedy are important as soon as there is
more than one possible match in the code:

 >>> import re
 >>> regexp = r'value="(.*)"\s*name\s*=\s*"(.*:\d+:\d+)"'
 >>> x = 'value="x1" name="y1:1:2" value="x2" name="y2:1:2"'
 >>> re.findall(regexp, x)
[('x1" name="y1:1:2" value="x2', 'y2:1:2')]

Oops! By default re's try to match as much as they can.
Add question marks!

 >>> regexp = r'value="(.*?)"\s*name\s*=\s*"(.*?:\d+:\d+)"'
 >>> re.findall(regexp, x)
[('x1', 'y1:1:2'), ('x2', 'y2:1:2')]

That's better!

>are any other possibilities to parse the HTML using functions already
>implemented  (i.e. HTMLPasrse module) ??

In general, using home brewn re's to parse HTML is not a good
idea. You can use HTMLParser or sgmllib for HTML, or an XML
parser for XHTML.

Just imagine that someone writes 'name = "x:1:2" value = "y" instead
of 'value = "y" name = "x:1:2"'. From an HTML point of view it's the
same thing, but for your re, it's not. Does your code handle line
breaks everywhere that HTML does?  For HTML, VALUE= is the same as
value=. Is it for your re? Look at this as an alternative:

 >>> import htmllib, formatter
 >>> h = '<html><div value = "T" name = "Guido"></div><div name = "Tim" 
Value = "G"></div></html>'
 >>> class myHTMLParser(htmllib.HTMLParser):
...     def start_div(self, attrs):
...             print attrs
...
 >>> p = myHTMLParser(formatter.NullFormatter())
 >>> p.feed(h)
[('value', 'T'), ('name', 'Guido')]
[('name', 'Tim'), ('value', 'G')]

You mean that you are parsing the actual HTML forms, not the data
passed as the form is processed, right? If you work with a CGI
script you should obviously be using the cgi module, and not parse
HTML at all.




--
Magnus Lycka (It's really Lyck&aring;), magnus@thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The shortest path from thought to working program