[ python-Bugs-857676 ] RE engine internal error with LARGE RE: scalability bug

SourceForge.net noreply at sourceforge.net
Thu Sep 9 10:04:42 CEST 2004


Bugs item #857676, was opened at 2003-12-10 17:21
Message generated for change (Comment added) made by effbot
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=857676&group_id=5470

Category: Regular Expressions
Group: Python 2.3
>Status: Closed
>Resolution: Wont Fix
Priority: 3
Submitted By: Francisco Dellatorre Borges (fdborges)
Assigned to: Fredrik Lundh (effbot)
Summary: RE engine internal error with LARGE RE: scalability bug

Initial Comment:
I lot's of lines with the format:

(d+ at w[^|]+|)+

I'll call the last bit after the @ of /features/. I
need to delete some of these,  so I have this code that
would produce a list matching what I would /not/ use,
pass it over re.escape and them  build a re using a
concatenation of the list and delete this from the text
before actually doing any parsing.

Problem is: I have about 220.000 different features and
I need to delete some 200.000 different ones from  my
files before doing something. 

So I tried to use a list of the 20.000 I want and then
delete anything that matches the <not> of it:

#---------------
# ftrlist is the stuff I *want* to keep: 
ftrlist = [re.escape(i) for i in ftrlist ]
re.compile(r'(?!(%s))'  %( '|'.join(ftrlist)) )
#--------------

but when I apply it I get something like:

RuntimeError: internal error in regular expression engine

I tried the same thing but with a smaller number of
elements, say 1000 ftrlist[:1000], and then it worked. 

So I guess there is a bug on the scalability of the re
engine when doing alternative searches.

Attached I'm sending a tar ball that reproduces this.
I'm gzipping it (hope sourceforge does not have a
problem with the resulting binary file).

----------------------------------------------------------------------

>Comment By: Fredrik Lundh (effbot)
Date: 2004-09-09 10:04

Message:
Logged In: YES 
user_id=38376

Alternative solution proposed; no response from submitter
in ten months.

----------------------------------------------------------------------

Comment By: Fredrik Lundh (effbot)
Date: 2003-12-11 14:39

Message:
Logged In: YES 
user_id=38376

The SRE engine is usually compiled with 16-bit opcodes, 
which places a limit on the size of the compiled engine.  
Changing that to 32 bits might solve your problem...

But in this case, I don't think one huge RE is really the best 
tool for the task; it's as easy (and much more efficient) to 
use an RE to *extract* the feature code, and use a 
dictionary to keep track of the ones you want to keep.

(example: use re.sub("d+@(w[^|]+)|", callback, data) and let 
the callback look m.group(1) up in the dictionary; return 
m.group(0) if you want to keep the feature, "" if you don't).

Hope this helps!

</F>


----------------------------------------------------------------------

Comment By: Francisco Dellatorre Borges (fdborges)
Date: 2003-12-10 17:35

Message:
Logged In: YES 
user_id=576554

Ok, I give up on loading the file the size is restricted as
one might have expected. If you want download the tar ball
that would reproduce the error from
www.let.rug.nl/~borges/ScalabilityREBug.tar.gz

----------------------------------------------------------------------

Comment By: Francisco Dellatorre Borges (fdborges)
Date: 2003-12-10 17:31

Message:
Logged In: YES 
user_id=576554

apparently sf does not want me to load tar balls... or I'm
too dumb to do it.

----------------------------------------------------------------------

Comment By: Francisco Dellatorre Borges (fdborges)
Date: 2003-12-10 17:24

Message:
Logged In: YES 
user_id=576554

Problem with the file name the first time. Here is the file.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=857676&group_id=5470


More information about the Python-bugs-list mailing list