[Python-bugs-list] [Bug #116251] SRE miscompiles character class containing -

noreply@sourceforge.net noreply@sourceforge.net
Fri, 6 Oct 2000 10:56:27 -0700


Bug #116251, was updated on 2000-Oct-06 09:12
Here is a current snapshot of the bug.

Project: Python
Category: Library
Status: Open
Resolution: None
Bug Group: None
Priority: 6
Summary: SRE miscompiles character class containing -

Details: (Found by Neil Schemenauer)  Consider this test program:

import re

p = re.compile('[\w]+')
m = p.match('laser_beam')
print m and m.span()

p = re.compile('[\w-]+')
m = p.match('laser_beam')
print m and m.span()

This prints (0,10) and None, but the second pattern just adds a - inside the character class, so it should still match.  Printing the code generated by the two patterns 
shows that they're compiled differently.

(Is there a disassembler for SRE byte code hiding somewhere? 
I'd have dug further if there was...)

Follow-Ups:

Date: 2000-Oct-06 09:22
By: akuchling

Comment:
Found the .dump() method; it seems to me that the pattern is
being tokenized and compiled to a sequence all right.  

Incidentally, [\w+]+ matches correctly, even though
[\w-]+ doesn't.  
-------------------------------------------------------

Date: 2000-Oct-06 10:56
By: akuchling

Comment:
The bug is in sre_parse._parse(); it produces a bad 
parse tree when a category such as \w is followed by -.
Something like \w0- works fine.  Not sure what the fix is...

-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=116251&group_id=5470