help with simple regular expression grouping with re

Bob Horvath bob at horvath.com
Tue May 11 03:41:51 EDT 1999


Tim Peters wrote:

> [Tim]
> > | import re
> > | pattern = re.compile(r"""
> > |     "           # match an open quote
> > |     (           # start a group so re.findall returns only this part
> > |         [^"]*?  # match shortest run of non-quote characters
> > |     )           # close the group
> > |     "           # and match the close quote
> > | """, re.VERBOSE)
> > |
> > | answer = re.findall(pattern, your_example)
> > | for field in answer:
> > |     print field
>
> [Dan Schmidt]
> > This works for a tricky reason, which people should be aware of.
>
> *All* regexps work for a tricky reason -- or, at least, the ones that
> actually do work <wink>.
>
> > I had just written the following response to your code:
> >
> >   Not that it's important, but technically, what you did was overkill.
> >   Because *? is non-greedy, it won't match any quote characters,
> >   because it will be happy to hand off the quote to the next element
> >   of the regexp, which does match it.
> >
> >   So "(.*?)" and "([^"]*)" both solve the problem; you don't need to
> >   disallow quotes _and_ match non-greedily.
> >
> > And then I decided to test it, just to make sure (replacing '[^"]'
> > with '.'), and... it failed.  Because '.' doesn't match newlines by
> > default.  When I added re.DOTALL to the options at the end, it worked
> > fine.
> >
> > Your example works because the character class [^"] (everything
> > but a double quote) happens to include newlines too.  (Actually, I
> > think you took the newlines out of the input string before you tested
> > it, so maybe you were just lucky).
>
> I tested it both ways, reported on one, and have no idea which way is
> correct:  every time CSV parsing comes up, the questioner is unable to
> define what (exactly) the rules are, and the appearance of line breaks in
> the original example could simply be an artifact of a transport or mailer
> breaking a long line.  In the face of the unknown, seemed better to be
> permissive.

Being the original poster....

My problem has CSV that does not cross word boundaries, and does not contain
quotes within the fields (I had to check), but probably could some day.   I'll
have to try it and see what it does.

The line crossing will never happen though.


>
>
> > So my new claim is that the following is the 'best' regexp, for my
> > personal definition of best (internal comments deleted):
> >
> > pattern = re.compile(r'"(.*?)"', re.VERBOSE | re.DOTALL)
>
> The original was indeed overkill, but for another reason <wink>:  it's also
> the case that whenever CSV parsing comes up, a later msg in the thread goes
> "oh! I forgot -- it can have *embedded* quotes too".  Writing it [^"] is
> anticipating a  step in how the regexp will need to be changed anyway to
> accommodate whichever escape convention they think they've
> reverse-engineered <0.1 wink>.
>
> Even without that prognostication, though, a greedy "([^"]*)" is (as Aahz
> said) likely to run faster than a non-greedy "(.*?)".  [^"]* is also more
> robust, in that it unconditionally forbids matching a double quote in the
> guts; what .*? matches depends on context, and will happily chew up double
> quotes too if the context requires it for the *context* to match.  In this
> particular regexp as a whole that won't happen, but under *modification*
> context-sensitive submatches are notoriously prone to surprises.
>
> In any case, I certainly didn't need to do both [^"] and *? in the original!
> My "best" would consist of removing the question mark <wink>.
>
> otoh-if-embedded-quotes-are-really-illegal-string.split-with-a-little-
>     post-processing-would-be-best-of-all-ly y'rs  - tim





More information about the Python-list mailing list