help with simple regular expression grouping with re

Tim Peters tim_one at email.msn.com
Tue May 11 01:26:23 EDT 1999


[Tim]
> | import re
> | pattern = re.compile(r"""
> |     "           # match an open quote
> |     (           # start a group so re.findall returns only this part
> |         [^"]*?  # match shortest run of non-quote characters
> |     )           # close the group
> |     "           # and match the close quote
> | """, re.VERBOSE)
> |
> | answer = re.findall(pattern, your_example)
> | for field in answer:
> |     print field

[Dan Schmidt]
> This works for a tricky reason, which people should be aware of.

*All* regexps work for a tricky reason -- or, at least, the ones that
actually do work <wink>.

> I had just written the following response to your code:
>
>   Not that it's important, but technically, what you did was overkill.
>   Because *? is non-greedy, it won't match any quote characters,
>   because it will be happy to hand off the quote to the next element
>   of the regexp, which does match it.
>
>   So "(.*?)" and "([^"]*)" both solve the problem; you don't need to
>   disallow quotes _and_ match non-greedily.
>
> And then I decided to test it, just to make sure (replacing '[^"]'
> with '.'), and... it failed.  Because '.' doesn't match newlines by
> default.  When I added re.DOTALL to the options at the end, it worked
> fine.
>
> Your example works because the character class [^"] (everything
> but a double quote) happens to include newlines too.  (Actually, I
> think you took the newlines out of the input string before you tested
> it, so maybe you were just lucky).

I tested it both ways, reported on one, and have no idea which way is
correct:  every time CSV parsing comes up, the questioner is unable to
define what (exactly) the rules are, and the appearance of line breaks in
the original example could simply be an artifact of a transport or mailer
breaking a long line.  In the face of the unknown, seemed better to be
permissive.

> So my new claim is that the following is the 'best' regexp, for my
> personal definition of best (internal comments deleted):
>
> pattern = re.compile(r'"(.*?)"', re.VERBOSE | re.DOTALL)

The original was indeed overkill, but for another reason <wink>:  it's also
the case that whenever CSV parsing comes up, a later msg in the thread goes
"oh! I forgot -- it can have *embedded* quotes too".  Writing it [^"] is
anticipating a  step in how the regexp will need to be changed anyway to
accommodate whichever escape convention they think they've
reverse-engineered <0.1 wink>.

Even without that prognostication, though, a greedy "([^"]*)" is (as Aahz
said) likely to run faster than a non-greedy "(.*?)".  [^"]* is also more
robust, in that it unconditionally forbids matching a double quote in the
guts; what .*? matches depends on context, and will happily chew up double
quotes too if the context requires it for the *context* to match.  In this
particular regexp as a whole that won't happen, but under *modification*
context-sensitive submatches are notoriously prone to surprises.

In any case, I certainly didn't need to do both [^"] and *? in the original!
My "best" would consist of removing the question mark <wink>.

otoh-if-embedded-quotes-are-really-illegal-string.split-with-a-little-
    post-processing-would-be-best-of-all-ly y'rs  - tim






More information about the Python-list mailing list