[ python-Bugs-1737127 ] re.findall hangs python completely
SourceForge.net
noreply at sourceforge.net
Mon Jun 18 19:23:38 CEST 2007
Bugs item #1737127, was opened at 2007-06-14 08:05
Message generated for change (Comment added) made by gregsmith
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1737127&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Regular Expressions
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Arno Bakker (abakker)
Assigned to: Gustavo Niemeyer (niemeyer)
Summary: re.findall hangs python completely
Initial Comment:
Running a re.findall() on 40 KB of HTML appears to hang python completely. It hogs the CPU (perhaps not unexpected) but other python threads do not continue and pressing Ctrl-C does not trigger a KeyboardInterrupt. Only a SIGQUIT (Ctrl-\) can kill it.
Attached is a small script to illustrate the problem, and the data file that causes it to hang. Using 40 KB of random data does let it get past the first findall. It creates a Thread that should printout hashes continuously, however, as soon as the MainThread hits the findall the printing stops.
Occurs on Python-2.4.4 (direct from www.python.org) and 2.5.1 (2.5.1-0ubuntu1 from Feisty)
----------------------------------------------------------------------
Comment By: Gregory Smith (gregsmith)
Date: 2007-06-18 13:23
Message:
Logged In: YES
user_id=292741
Originator: NO
First off, don't expect other threads to run during re execution.
Multi-threading in python is mainly to allow one thread to run while the
others are waiting for I/O or doing a time.sleep() or something specific
like that. Switching between runnable threads only occurs in interpreter
loop.
There may exceptions to allow switching during some really long core
operations (a mutex needs to be released and taken again) but it has to be
done under certain conditions so that threads won't mess each other's data
up.
So, on to the r.e.: first, try changing all the .*? to just .* -- the ?
is redundant and may be increasing the runtime by expanding the number of
permutations that are being tried.
But I think your real trouble is all of these : img src=\"(.*?)\"
This allows the second " to match with anything at all between, including
any number of quoted strings.
Your combination of several of these may be causing the RE engine to
spend a huge amount of time looking at many different combinations for the
first few .*?, all of which fail by the time you get to the last one.
Try img src=\"([^"]*)\" instead; this will only match the pair of "
with no " in between.
Likewise, in .*?> the .* will match any number of '>' chars if this is
needed to make the whole thing match, which is probably not what you want.
You might get it to work just by turning off 'greedy' matching for '*'.
----------------------------------------------------------------------
Comment By: Arno Bakker (abakker)
Date: 2007-06-14 08:06
Message:
Logged In: YES
user_id=216477
Originator: YES
File Added: page.html
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1737127&group_id=5470
More information about the Python-bugs-list
mailing list