[IronPython] differences in IronPython/CPython regular expressions?
jdhardy at gmail.com
Thu Jun 2 01:23:17 CEST 2011
On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen <janssen at parc.com> wrote:
> I have a large RE (223613 chars) that works fine in CPython 2.6, but
That's truly horrible, but I assume you have a good reason for it.
> seems to produce an endless loop in IronPython (see below). I'm using
> Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7. Anyone have
> pointers to the differences between them? Is
> System::Text::RegularExpressions in .NET configurable in some fashion
> that might help?
First off, is there a reason you don't use re.IGNORECASE? That would
cut the regex in half, at least.
For the most part, CPython and IronPython regexes should be fairly
compatible - IronPython takes the regex and massages it to work with
System.Text.RE, but the changes are pretty straightforward and small,
and I don't think the re you provided hits any of them. It's quite
possible that the Mono version of System.Text.RE can't handle the
expression; you could test this saving the full regex and building a
small C# program that runs it. The regex template has a lot of
potential backtracking in it; are you sure it's not caught in a
pathological (exponential) case?
Finally, is one ginormous really the best way to do this? Have you
tried other approaches?
> I'm a .NET newbie.
> import sys, os, re
> # we use the name lists in nltk to create person-name matching patterns
> import nltk.data
> except ImportError:
> sys.stderr.write("Can't import nltk; can't do name lists.\nSee http://www.nltk.org/.\n")
> __MALE_NAME_EXCLUDES = ("Hill",
> __FEMALE_NAME_EXCLUDES = ()
> __FEMALE_NAMES = [x for x in
> nltk.data.load("corpora/names/female.txt", format="raw").split("\n")
> if (x and (x not in __FEMALE_NAME_EXCLUDES))]
> __FEMALE_NAMES += [x.upper() for x in __FEMALE_NAMES]
> __MALE_NAMES = [x for x in
> nltk.data.load("corpora/names/male.txt", format="raw").split("\n")
> if (x and (x not in __MALE_NAME_EXCLUDES))]
> __MALE_NAMES += [x.upper() for x in __MALE_NAMES]
> __INITS = [chr(x) for x in range(ord('A'), ord('Z'))]
> PERSON_PATTERN = re.compile(
> "^((?P<honorific>Mr|Ms|Mrs|Dr|MR|MS|MRS|DR)\.? )?" # honorific
> "(?P<firstname>" +
> "|".join(__FEMALE_NAMES + __MALE_NAMES + __INITS) + # first name
> "( (?P<middlename>([A-Z]\.)|(" +
> "|".join(__FEMALE_NAMES + __MALE_NAMES) + # middle initial or name
> " +(?P<lastname>[A-Z][A-Za-z]+)", # space then last name
> print PERSON_PATTERN.match("Mr. John Smith")
> Users mailing list
> Users at lists.ironpython.com
More information about the Ironpython-users