[Tutor] Regexp question
Magnus Lyckå
magnus@thinkware.se
Mon May 19 11:37:01 2003
At 17:30 2003-05-19 +0300, Ovidiu Bivolaru wrote:
>Hi all,
>
>I'm trying to parse some HTML forms to get the values from "name" and
>"value" attributes and then to add them in a list. I'm encountering a
>problem with the regular expressions and I can't figure out why the
>expression is invalid.
>
>Bellow is the code that I'm using:
> for lines in buffer.readlines():
> print lines
> regexp = 'value="(.*)?"\s*name\s*=\s*"(.*:\d+:\d+)?"'
In general you need to double your backslashes or use raw strings,
or python will try to interpret bacslashes instead of passing them to
the re module. In these particular cases I don't think it matters,
because \s and \d don't mean anything, but in general, using raw
strings for re's is best.
Also, and more importantly in this case, for non-greedy searches,
the ? should come directly after the + or *; not after the ) which
ends the group. Also, it's the '.*' which is nasty! In other words:
regexp = r'value="(.*?)"\s*name\s*=\s*"(.*?:\d+:\d+)"'
The ? to make the .*'s non-greedy are important as soon as there is
more than one possible match in the code:
>>> import re
>>> regexp = r'value="(.*)"\s*name\s*=\s*"(.*:\d+:\d+)"'
>>> x = 'value="x1" name="y1:1:2" value="x2" name="y2:1:2"'
>>> re.findall(regexp, x)
[('x1" name="y1:1:2" value="x2', 'y2:1:2')]
Oops! By default re's try to match as much as they can.
Add question marks!
>>> regexp = r'value="(.*?)"\s*name\s*=\s*"(.*?:\d+:\d+)"'
>>> re.findall(regexp, x)
[('x1', 'y1:1:2'), ('x2', 'y2:1:2')]
That's better!
>are any other possibilities to parse the HTML using functions already
>implemented (i.e. HTMLPasrse module) ??
In general, using home brewn re's to parse HTML is not a good
idea. You can use HTMLParser or sgmllib for HTML, or an XML
parser for XHTML.
Just imagine that someone writes 'name = "x:1:2" value = "y" instead
of 'value = "y" name = "x:1:2"'. From an HTML point of view it's the
same thing, but for your re, it's not. Does your code handle line
breaks everywhere that HTML does? For HTML, VALUE= is the same as
value=. Is it for your re? Look at this as an alternative:
>>> import htmllib, formatter
>>> h = '<html><div value = "T" name = "Guido"></div><div name = "Tim"
Value = "G"></div></html>'
>>> class myHTMLParser(htmllib.HTMLParser):
... def start_div(self, attrs):
... print attrs
...
>>> p = myHTMLParser(formatter.NullFormatter())
>>> p.feed(h)
[('value', 'T'), ('name', 'Guido')]
[('name', 'Tim'), ('value', 'G')]
You mean that you are parsing the actual HTML forms, not the data
passed as the form is processed, right? If you work with a CGI
script you should obviously be using the cgi module, and not parse
HTML at all.
--
Magnus Lycka (It's really Lyckå), magnus@thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The shortest path from thought to working program