[ python-Bugs-1737127 ] re.findall hangs python completely

SourceForge.net noreply at sourceforge.net
Tue Jun 19 16:03:54 CEST 2007

Bugs item #1737127, was opened at 2007-06-14 14:05
Message generated for change (Comment added) made by abakker
You can respond by visiting: 

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Regular Expressions
Group: None
>Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Arno Bakker (abakker)
Assigned to: Gustavo Niemeyer (niemeyer)
Summary: re.findall hangs python completely

Initial Comment:

Running a re.findall() on 40 KB of HTML appears to hang python completely. It hogs the CPU (perhaps not unexpected) but other python threads do not continue and pressing Ctrl-C does not trigger a KeyboardInterrupt. Only a SIGQUIT (Ctrl-\) can kill it.

Attached is a small script to illustrate the problem, and the data file that causes it to hang. Using 40 KB of random data does let it get past the first findall. It creates a Thread that should printout hashes continuously, however, as soon as the MainThread hits the findall the printing stops.

Occurs on Python-2.4.4 (direct from www.python.org) and 2.5.1 (2.5.1-0ubuntu1 from Feisty)


>Comment By: Arno Bakker (abakker)
Date: 2007-06-19 16:03

Logged In: YES 
Originator: YES

Is that GIL & searching problem reported separately somewhere, otherwise
I'm hereby submitting that bug ;o)


Comment By: Georg Brandl (gbrandl)
Date: 2007-06-19 14:44

Logged In: YES 
Originator: NO

This is quite normal for regular expressions with a lot of backtracking
permutations to try, and a big string to search in.

You should try to optimize your REs -- wrt. the threads, re doesn't
release the GIL while searching, that's another
bug report.


Comment By: Gregory Smith (gregsmith)
Date: 2007-06-18 19:23

Logged In: YES 
Originator: NO

First off, don't expect other threads to run during re execution.
Multi-threading in python is mainly to allow one thread to run while the
others are waiting for I/O or doing a time.sleep() or something specific
like that. Switching between runnable threads only occurs in interpreter
There may exceptions to allow switching during some really long core
operations (a mutex needs to be released and taken again) but it has to be
done under certain conditions so that threads won't mess each other's data

So, on to the r.e.: first, try changing all the .*? to just .*  -- the ?
is redundant and may be increasing the runtime by expanding the number of
permutations that are being tried.

But I think your real trouble is all of these :  img src=\"(.*?)\"
This allows the second " to match with anything at all between, including
any number of quoted strings.
 Your combination of several of these may be causing the RE engine to
spend a huge amount of time looking at many different combinations for the
first few .*?, all of which fail by the time you get to the last one.

Try   img src=\"([^"]*)\"  instead; this will only match the pair of "
with no " in between.

Likewise, in .*?> the .* will match any number of '>' chars if this is
needed to make the whole thing match, which is probably not what you want.

You might get it to work just by turning off 'greedy' matching for '*'.


Comment By: Arno Bakker (abakker)
Date: 2007-06-14 14:06

Logged In: YES 
Originator: YES

File Added: page.html


You can respond by visiting: 

More information about the Python-bugs-list mailing list