[Mailman-Users] utf-8 subjects; extended "." regexp really necessary?

Tue Nov 24 21:53:10 EST 2015

Adrian Pepper writes:
 > (Mailman 2.1.12, some local mods, but not around topics...)
 > 
 >  I had a utf-8 subject I was having difficulty matching with a topic regexp.
 > 
 >  Eventually I concluded the subject still had newlines in it when it was
 >  matched against the regexp.  (That is the continuation lines were not
 >  joined before matching).  And "." would not match the newline character(s)).

 >  Am I correct in my conclusion that .* won't match newline characters,
 >  but <space-chars><not-space-chars><linefeed><carriage-return> will ?
 >  (And also, that that is the character class I created).

Yes.  Here are the docs for Python regular expressions as used in
Mailman: https://docs.python.org/2.7/library/re.html.

In general this problem would be addressed with the DOTALL flag:

    The special characters are:

    '.'
    (Dot.) In the default mode, this matches any character except a
    newline. If the DOTALL flag has been specified, this matches any
    character including a newline.

Note that the definition of "newline" here is exactly "\n".

However, in your case I think there's a simpler method.

 >  For production I might need to put [\s\S\n\r]* between every pair of
 >  characters after a reasonable point in the expression.  Unless I can
 >  enumerate the possibilities more precisely.  (Which will probably
 >  result in an even longer looking character class).

Well, actually what you need is just "\s*" (or perhaps "\s+" or
"(\s|_)+") wherever a space might occur in the topic regexp, I think.
Line folding can only occur at whitespace (breaking this rule would be
noticed by everybody, and so is not likely to go unfixed), and "\s"
already includes "\n".

 >  Empirically I see  ?=\n =?utf-8?q?_ after "Weekly" and before "Ac".
 >  (And it seems the matching is done on the incoming subject, not the
 >  one formatted for resending, which, with my tag, and the utf-8
 >  of an incoming tag pushes the expression entirely onto the second
 >  line where I think the ".*" variant (or even [_ ]) would match.

That would explain your observations, but I am not familiar with the
topic code.  I don't have time to address that until the weekend, and
maybe not then as $DAYJOB is piling up work on me, and Mark is on
vacation in Croatia, so you may have to wait a bit for a final answer
on that.  I'm sorry about that, but I think at least for now the "\s*"
bandaid will get you most of the way to where you want to go.