[Mailman-Users] utf-8 subjects; extended "." regexp really necessary?
Stephen J. Turnbull
stephen at xemacs.org
Tue Nov 24 21:53:10 EST 2015
Adrian Pepper writes:
> (Mailman 2.1.12, some local mods, but not around topics...)
>
> I had a utf-8 subject I was having difficulty matching with a topic regexp.
>
> Eventually I concluded the subject still had newlines in it when it was
> matched against the regexp. (That is the continuation lines were not
> joined before matching). And "." would not match the newline character(s)).
> Am I correct in my conclusion that .* won't match newline characters,
> but <space-chars><not-space-chars><linefeed><carriage-return> will ?
> (And also, that that is the character class I created).
Yes. Here are the docs for Python regular expressions as used in
Mailman: https://docs.python.org/2.7/library/re.html.
In general this problem would be addressed with the DOTALL flag:
The special characters are:
'.'
(Dot.) In the default mode, this matches any character except a
newline. If the DOTALL flag has been specified, this matches any
character including a newline.
Note that the definition of "newline" here is exactly "\n".
However, in your case I think there's a simpler method.
> For production I might need to put [\s\S\n\r]* between every pair of
> characters after a reasonable point in the expression. Unless I can
> enumerate the possibilities more precisely. (Which will probably
> result in an even longer looking character class).
Well, actually what you need is just "\s*" (or perhaps "\s+" or
"(\s|_)+") wherever a space might occur in the topic regexp, I think.
Line folding can only occur at whitespace (breaking this rule would be
noticed by everybody, and so is not likely to go unfixed), and "\s"
already includes "\n".
> Empirically I see ?=\n =?utf-8?q?_ after "Weekly" and before "Ac".
> (And it seems the matching is done on the incoming subject, not the
> one formatted for resending, which, with my tag, and the utf-8
> of an incoming tag pushes the expression entirely onto the second
> line where I think the ".*" variant (or even [_ ]) would match.
That would explain your observations, but I am not familiar with the
topic code. I don't have time to address that until the weekend, and
maybe not then as $DAYJOB is piling up work on me, and Mark is on
vacation in Croatia, so you may have to wait a bit for a final answer
on that. I'm sorry about that, but I think at least for now the "\s*"
bandaid will get you most of the way to where you want to go.
More information about the Mailman-Users
mailing list