[ python-Bugs-857676 ] RE engine internal error with LARGE RE:
scalability bug
SourceForge.net
noreply at sourceforge.net
Thu Dec 11 08:39:21 EST 2003
Bugs item #857676, was opened at 2003-12-10 17:21
Message generated for change (Comment added) made by effbot
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=857676&group_id=5470
Category: Regular Expressions
Group: Python 2.3
Status: Open
Resolution: None
>Priority: 3
Submitted By: Francisco Dellatorre Borges (fdborges)
Assigned to: Fredrik Lundh (effbot)
Summary: RE engine internal error with LARGE RE: scalability bug
Initial Comment:
I lot's of lines with the format:
(d+ at w[^|]+|)+
I'll call the last bit after the @ of /features/. I
need to delete some of these, so I have this code that
would produce a list matching what I would /not/ use,
pass it over re.escape and them build a re using a
concatenation of the list and delete this from the text
before actually doing any parsing.
Problem is: I have about 220.000 different features and
I need to delete some 200.000 different ones from my
files before doing something.
So I tried to use a list of the 20.000 I want and then
delete anything that matches the <not> of it:
#---------------
# ftrlist is the stuff I *want* to keep:
ftrlist = [re.escape(i) for i in ftrlist ]
re.compile(r'(?!(%s))' %( '|'.join(ftrlist)) )
#--------------
but when I apply it I get something like:
RuntimeError: internal error in regular expression engine
I tried the same thing but with a smaller number of
elements, say 1000 ftrlist[:1000], and then it worked.
So I guess there is a bug on the scalability of the re
engine when doing alternative searches.
Attached I'm sending a tar ball that reproduces this.
I'm gzipping it (hope sourceforge does not have a
problem with the resulting binary file).
----------------------------------------------------------------------
>Comment By: Fredrik Lundh (effbot)
Date: 2003-12-11 14:39
Message:
Logged In: YES
user_id=38376
The SRE engine is usually compiled with 16-bit opcodes,
which places a limit on the size of the compiled engine.
Changing that to 32 bits might solve your problem...
But in this case, I don't think one huge RE is really the best
tool for the task; it's as easy (and much more efficient) to
use an RE to *extract* the feature code, and use a
dictionary to keep track of the ones you want to keep.
(example: use re.sub("d+@(w[^|]+)|", callback, data) and let
the callback look m.group(1) up in the dictionary; return
m.group(0) if you want to keep the feature, "" if you don't).
Hope this helps!
</F>
----------------------------------------------------------------------
Comment By: Francisco Dellatorre Borges (fdborges)
Date: 2003-12-10 17:35
Message:
Logged In: YES
user_id=576554
Ok, I give up on loading the file the size is restricted as
one might have expected. If you want download the tar ball
that would reproduce the error from
www.let.rug.nl/~borges/ScalabilityREBug.tar.gz
----------------------------------------------------------------------
Comment By: Francisco Dellatorre Borges (fdborges)
Date: 2003-12-10 17:31
Message:
Logged In: YES
user_id=576554
apparently sf does not want me to load tar balls... or I'm
too dumb to do it.
----------------------------------------------------------------------
Comment By: Francisco Dellatorre Borges (fdborges)
Date: 2003-12-10 17:24
Message:
Logged In: YES
user_id=576554
Problem with the file name the first time. Here is the file.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=857676&group_id=5470
More information about the Python-bugs-list
mailing list