[Mailman-Developers] Quoting problem in 2.0

Georg Mischler schorsch@schorsch.com
Wed, 3 Jan 2001 18:12:04 -0500 (EST)


Barry A. Warsaw wrote:

> 
> >>>>> "GM" == Georg Mischler <schorsch@schorsch.com> writes:
> 
>     GM> Barry A. Warsaw wrote:
> 
>     >>  You're right it is a simple fix, see below.
> 
>     GM> While you're at it... The following is also a simple fix,
>     GM> which eliminates at least some of the "inexplicable" failures
>     GM> to create HTML archives:
> 
> This doesn't seem right.  From the referenced archive message, the
> change is supposed to add a hit for negative timezones, but that's
> /not/ what those last three \d's are trying to match.  They're trying
> to match a four-digit year.  It makes no sense to add an optional sign
> to the year matching field.
 
As happens once in a while, I'm slightly confused now. My own
experiments demonstrated to me that this change removes the
problem, or at least that's what I think they demonstrated.

The following is a typical "From " line as I often encounter them:

From schorsch@schorsch.com Thu Jun 10 13:09:41 1999 -0400


From my understanding, this matches to the pattern like follows
(whitespace inserted for clarity):

 'From \s* \S+                   \s+ \w\w\w \s+ \w\w\w \s+ \d\d? \s+
 'From     schorsch@schorsch.com     Thu        Jun        10  
  
  \d\d?:\d\d(:\d\d)? (\s+ \S+)? \s+ [+-]? \d\d\d\d \s *$'
  13   :09   :41          1999      -     0400        '


I must admit that I'm not completely sure why the \S (non
whitespace) matching the year is grouped together with the
preceding whitespace (hmmm... the year is probably optional?).
But in any case, unless I'm missing something crucial, the above
interpretation confirms my experiments exactly to the point.

In my experience, the above real-life "From " header is matched
by the modified pattern, but not by the original one. Thinking
about it, it seems that the last 4*\d group matches *either*
the year, *or* the time zone, depending on the existence of
one of them. But even if this is the case, it shouldn't be a
problem (except for potentially matching a negative year...)

If it is actually the timezone that is optional, then the
grouping might rather be needed there instead of with the year.
Or is the pattern meant to match a line where the timezone
comes before the year? We'd need to allow for both possibilities
then. My suggestion to use the very robust parsedate_tz()
function from rfc822.py instead starts to make more and more
sense to me.

Or am I hallucinating beyond repair here?


-schorsch

-- 
Georg Mischler  --  simulations developer  --  schorsch at schorsch.com
+schorsch.com+  --  lighting design tools  --  http://www.schorsch.com/