split encloser

Jason Tiller jtiller at sjm.com
Thu Apr 3 19:45:12 EST 2003


Hi, "Aussie", :)

On 3 Apr 2003 aussie2010 at yahoo.com wrote:

> string.split() takes a delimiter and works fine as long as the
> delimiter isn't part of the data fields. But frequently they are.
> e.g. 'John Doe,135 South Main St.,#122, Springfield, Iowa' or
>       ' so long goodbye see ya'

> Because the fields can contain the delimiter in some cases, an
> encloser is usually used (typically "") to handle those fields.

> The above strings would be written:
>
> 'John Doe,"135 South Main St., #122", Springfield, Iowa'
>    and
> '"so long" goodbye "see ya"'

> I don't understand regular expressions but I was wondering if anyone
> that did knew of a way to get re.split() to handle "enclosers" as
> used above.

Hmm.  I am not yet a knowledgable user of Python's regex features, but
I *do* know Perl's pretty well.  With Perl, you might split up your
fields like this:

my $data = "Tiller, Jason, \"39177 Sundale Dr., Apt. 12\", Fremont, CA";

while( $data =~ /\G(?:\"(.+?)\",? ?)|\G(.+?), ?/g ) {
   my $field = $1 || $2;
   print "$field\n";
}

This says (essentially):
   Starting from the previous match point,
      Look for either:
         A series of characters surrounded by the "enclosers" or
         A series of characters up to the split character
      If found:
         Set our field to which of the matches succeeded
         Set the previous match point
      Else:
         Exit

The RE breaks down like this:
   /\G        # Anchored to the previous match location:
    (?:       # Don't capture the following group.
       \"     # Find an encloser
       (.+?)  # followed by a set of any characters (non-greedy)
       \"     # bounded by another encloser
       ,? ?)  # and potentially a split character and a space
    |         # OR
    \G        # Again achored at the previous match location
    (.+?)     # Find a set of any characters (non-greedy)
    , ?       # That end with the split character and possibly a space
   /gx        # Do this until the match fails and ignore whitespace in
              # the pattern

>From what I was reading in a modern edition (2.0) of "Programming
Python," Python has adopted many of Perl's RE enhancements
(non-capturing subexpressions, non-greedy quantifiers, additional
anchors, etc.), so the pattern may work in Python without much effort.
I dunno.

Good luck!  I hope this pattern at least gives you a starting point
for implementing the split() in Python.

---Jason
Sonos Handbell Ensemble






More information about the Python-list mailing list