[2.5] Regex doesn't support MULTILINE?
nospam at nospam.com
Sun Jul 22 06:56:32 CEST 2007
On Sat, 21 Jul 2007 22:18:56 -0400, Carsten Haese
<carsten at uniqsys.com> wrote:
>That's your problem right there. RE is not the right tool for that job.
>Use an actual HTML parser such as BeautifulSoup
Thanks a lot for the tip. I tried it, and it does look interesting,
although I've been unsuccessful using a regex with BS to find all
occurences of the pattern.
Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.
Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.
Why such a huge difference in performance?
========= Using Re =============
pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"
pages = ["500KB.html","1MB.html"]
#Veeeeeeeeeeery slow when parsing 1MB file !
p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
#p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)
for page in pages:
f = open(page, "r")
response = f.read()
start = time.strftime("%H:%M:%S", time.localtime(time.time()))
print "before findall @ " + start
packed = p.findall(response)
for item in packed:
More information about the Python-list