Percived gap in re.split functionality

Chris Akre dont at spam.me
Tue Oct 10 00:04:18 EDT 2000


I wanted to split a string at the / character... easily done with split()
->  'foo / bar'  ->  ['foo', 'bar']

Then I wanted to split at the / character when not encapsulated in quotes...
so I split() and then join()ed the bits with odd numbers of quotes.  Bad,
wrong, I know, but I am lazy.
->  '"kungfu/foo" / "barmaid/wench" '  ->  ["kungfu/foo", "barmaid/wench"]

Then I wanted to put in parentheses... i.e. '"kung(fu/foo)" /
bar(maid/wench)' so I broke out re... and everything was good for a while..
getting the right regular expression to match slashes (when not in quotes)
and (when not in parentheses not in qoutes) was a little tricky, but I got
it.  But then I ran into the general case, with nested parentheses,
brackets, angle brackets, curly backets, ad nauseum.

I was unable to write a regular expression that would split
->  '((i/o) tcp/ip "/)") / {[<</>/</>>]/}'
into
-> ['((te/xt) te/xt "/)") ', ' {[<</>/</>>]/}']
and in doing so, throw an exception if it encountered '([)]' or some other
way to mismatch grouping symbols.

Anyway... I wrote a function to deal with the general case... it is a
grouping and delimiting aware split function.  I started mid-day today, and
it is to the point where I can use it for my purposes.  I wrote it for the
general case, but I use it for a specific case, so YMMV.  If there is anyone
out there who thinks that this is almost but not quite the function they
need, let me know.  For instance if you want to use '<-', '->' or '/*' '*/'
as grouping symbols, or you want to split on grouping symbols (i.e. have
groupedsplit('<H1><B>text</B></H1>, '<>') return ['<H1>', '<B>', '</B>',
'</H1>'])  I like writing things for the most general case possible.

Any idea what the right solution for this would be?  Write this up as a
module, and distribute it with Python?  Extend the re module syntax to be
able to deal with nested and overlapping grouping and delimiting symbols?
Or is the functionality already there... and I'm just not good enough at
crafting regular expressions?

class sig:
    author = 'Chris Akre'
    email  = reduce(lambda x,y: x + y*(not y in 'SPAM'),
                    'cpSa at mPail.nAova.oMrg', '')

Code follows-----------------------------------
#
# group aware split utility
#

PAIRGROUPERS = {')':'(', # PAIRED GROUPERS work like normal parentheses
                ']':'[', # and must be defined as a dictionary (close:open).
                '>':'<', # UNI GROUPERS use the same character to define the
                '}':'{'} # start and stop of a group.  LITERAL GROUPERS
UNIGROUPERS  = ()        # ignore all groupers until a matching literal
LITGROUPERS  = ('"',)    # grouper is found.  TERMINAL CHARACTERS stop
TERMCHARS    = (';',)    # examining the string when found, and can be used
                         # to indicate the remainder of the string should be
                         # ignored.
CHARNAMES    = {'(': 'open parentheses',
                ')': 'close parentheses',
                '[': 'open square bracket',
                ']': 'close square bracket',
                '<': 'open angle bracket',
                '>': 'close angle bracket',
                '{': 'open curly bracket',
                '}': 'close curly bracket',
                '"': 'double quote',
                "'": 'single quote',
                ';': 'semicolon' }
ERROR_GROUPEDSPLIT    = 'Grouped-split function error'
ERROR_GROUPEDSPLITtxt = 'Error %s at position %d in source string:\n%s'

def groupedsplit(s, sep=None, maxsplit=-1, pairgroupers = PAIRGROUPERS,
                 unigroupers = UNIGROUPERS, litgroupers = LITGROUPERS,
                 termchars = TERMCHARS):
    if sep == None: sep = ' \t\n\r\f\v'
    allgroupers = tuple(pairgroupers.values()) + unigroupers + litgroupers
    groupers = {}
    for c in allgroupers: groupers[c] = []
    splits = []
    start = 0
    try:
        for curpos in range(0, len(s)):
            if reduce(lambda x, y, groupers=groupers: x or \
                                         len(groupers[y]), litgroupers, 0):
                                    # Are we in a literal group?
                if s[curpos] in litgroupers and \
                                           (len(groupers[s[curpos]]) != 0):
                                    # Are at the close of the literal group?
                    groupers[s[curpos]].pop()
                                    # Clear the literal group state
            elif s[curpos] in pairgroupers.values():
                groupers[s[curpos]].append(map(lambda x, \
                         groupers=groupers: len(groupers[x]), allgroupers))
                                    # Store the current grouper state
            elif s[curpos] in pairgroupers.keys():
                try:
                    tup = groupers[pairgroupers[s[curpos]]].pop()
                    if tup != map(lambda x, groupers=groupers: \
                                            len(groupers[x]), allgroupers):
                                    # Retrieve and compare to stored
                                    # grouper state
                        for i in range(0, len(allgroupers)):
                            if len(groupers[allgroupers[i]]) != tup[i]:
                                raise ERROR_GROUPEDSPLIT, 'expected %s '+ \
                                 'instead of %s' % (CHARNAMES[s[curpos]], \
                                                 CHARNAMES[allgroupers[i]])
                                    # Okay, the last three statements are
                                    # little clunky, but it gets the job
                                    # done.  I'm open to suggestions on a
                                    # cleaner way.
                        raise ERROR_GROUPEDSPLIT, 'undefined'
                                    # We should never reach that error
                except IndexError:
                    raise ERROR_GROUPEDSPLIT, '%s without starting %s' \
                                              % (CHARNAMES[s[curpos]], \
                                     CHARNAMES[pairgroupers[s[curpos]]])
            elif s[curpos] in unigroupers:
                if len(groupers[s[curpos]]) == 0:
                    groupers[s[curpos]].append(map(lambda x: \
                                            len(groupers[x]), allgroupers))
                                    # Store the current grouper state
                else:
                    tup = groupers[pairgroupers[s[curpos]]].pop()
                    if tup != map(lambda x: len(groupers[x]), allgroupers):
                                    # Retrieve and compare to stored
                                    # grouper state
                        for i in range(0, len(allgroupers)):
                            if len(groupers[allgroupers[i]]) != tup[i]:
                                raise ERROR_GROUPEDSPLIT, 'expected %s ' \
                                'instead of %s' % (CHARNAMES[s[curpos]], \
                                                CHARNAMES[allgroupers[i]])
                                    # Same clunky errorchecking
                        raise ERROR_GROUPEDSPLIT, 'undefined'
                                    # We should never reach that error
            elif s[curpos] in litgroupers:
                if len(groupers[s[curpos]]) == 0:
                    groupers[s[curpos]].append(1)
                else:
                    raise ERROR_GROUPEDSPLIT, 'undefined'
                                    # We should never reach that error
            elif s[curpos] in termchars:
                break
            elif (s[curpos] in sep) and ((len(splits)+1) != maxsplit) and \
                           not reduce(lambda x, y, groupers=groupers: x + \
                                         len(groupers[y]), allgroupers, 0):
                                    # Check if we are at a separator, not
                                    # in a group, and if we have reached
                                    # the maxsplit limit
                splits.append(s[start:curpos])
                start = curpos + 1
        for c in allgroupers:
            if len(groupers[c]) != 0:
                raise ERROR_GROUPEDSPLIT, 'unbalanced %s' % CHARNAMES[c]
    except ERROR_GROUPEDSPLIT, err:
        print ERROR_GROUPEDSPLITtxt % (err, curpos, s)
        raise
    else:
        splits.append(s[start:])
        return splits






More information about the Python-list mailing list