[IronPython] differences in IronPython/CPython regular expressions?

Jeff Hardy jdhardy at gmail.com
Thu Jun 2 01:23:17 CEST 2011


On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen <janssen at parc.com> wrote:
> I have a large RE (223613 chars) that works fine in CPython 2.6, but

That's truly horrible, but I assume you have a good reason for it.

> seems to produce an endless loop in IronPython (see below).  I'm using
> Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7.  Anyone have
> pointers to the differences between them?  Is
> System::Text::RegularExpressions in .NET configurable in some fashion
> that might help?

First off, is there a reason you don't use re.IGNORECASE? That would
cut the regex in half, at least.

For the most part, CPython and IronPython regexes should be fairly
compatible - IronPython takes the regex and massages it to work with
System.Text.RE, but the changes are pretty straightforward and small,
and I don't think the re you provided hits any of them. It's quite
possible that the Mono version of System.Text.RE can't handle the
expression; you could test this saving the full regex and building a
small C# program that runs it. The regex template has a lot of
potential backtracking in it; are you sure it's not caught in a
pathological (exponential) case?

Finally, is one ginormous really the best way to do this? Have you
tried other approaches?

- Jeff

>
> I'm a .NET newbie.
>
> TIA,
>
> Bill
>
> --------------------------------------------------
> import sys, os, re
>
> try:
>    # we use the name lists in nltk to create person-name matching patterns
>    import nltk.data
> except ImportError:
>    sys.stderr.write("Can't import nltk; can't do name lists.\nSee http://www.nltk.org/.\n")
>    sys.exit(1)
> else:
>    __MALE_NAME_EXCLUDES = ("Hill",
>                          "Ave",
>                          )
>    __FEMALE_NAME_EXCLUDES = ()
>    __FEMALE_NAMES = [x for x in
>                      nltk.data.load("corpora/names/female.txt", format="raw").split("\n")
>                      if (x and (x not in __FEMALE_NAME_EXCLUDES))]
>    __FEMALE_NAMES += [x.upper() for x in __FEMALE_NAMES]
>    __MALE_NAMES = [x for x in
>                    nltk.data.load("corpora/names/male.txt", format="raw").split("\n")
>                    if (x and (x not in __MALE_NAME_EXCLUDES))]
>    __MALE_NAMES += [x.upper() for x in __MALE_NAMES]
>    __INITS = [chr(x) for x in range(ord('A'), ord('Z'))]
>
> PERSON_PATTERN = re.compile(
>    "^((?P<honorific>Mr|Ms|Mrs|Dr|MR|MS|MRS|DR)\.? )?"         # honorific
>    "(?P<firstname>" +
>    "|".join(__FEMALE_NAMES + __MALE_NAMES + __INITS) + # first name
>    ")"
>    "( (?P<middlename>([A-Z]\.)|(" +
>    "|".join(__FEMALE_NAMES + __MALE_NAMES) +         # middle initial or name
>    ")))?"
>    " +(?P<lastname>[A-Z][A-Za-z]+)",             # space then last name
>    re.MULTILINE)
>
> print PERSON_PATTERN.match("Mr. John Smith")
> _______________________________________________
> Users mailing list
> Users at lists.ironpython.com
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>



More information about the Ironpython-users mailing list