utf-8 subjects; extended "." regexp really necessary?
(Mailman 2.1.12, some local mods, but not around topics...)
I had a utf-8 subject I was having difficulty matching with a topic regexp.
Eventually I concluded the subject still had newlines in it when it was matched against the regexp. (That is the continuation lines were not joined before matching). And "." would not match the newline character(s)).
So, for test purposes...
Farmers[_ ]Weekly[\s\S\n\r]*Ac
seemed to match my particular test subject.
While the following did not.
Farmers[_ ]Weekly.*Ac
Am I correct in my conclusion that .* won't match newline characters, but <space-chars><not-space-chars><linefeed><carriage-return> will ? (And also, that that is the character class I created).
For production I might need to put [\s\S\n\r]* between every pair of characters after a reasonable point in the expression. Unless I can enumerate the possibilities more precisely. (Which will probably result in an even longer looking character class).
Empirically I see ?=\n =?utf-8?q?_ after "Weekly" and before "Ac". (And it seems the matching is done on the incoming subject, not the one formatted for resending, which, with my tag, and the utf-8 of an incoming tag pushes the expression entirely onto the second line where I think the ".*" variant (or even [_ ]) would match.
More generally, my question applies to any potentially long subject, but utf-8 subjects seem to get longer more easily.
There is no header-equivalent line in the body (it's mime anyway).
Adrian Pepper
Adrian Pepper writes:
(Mailman 2.1.12, some local mods, but not around topics...)
I had a utf-8 subject I was having difficulty matching with a topic regexp.
Eventually I concluded the subject still had newlines in it when it was matched against the regexp. (That is the continuation lines were not joined before matching). And "." would not match the newline character(s)).
Am I correct in my conclusion that .* won't match newline characters, but <space-chars><not-space-chars><linefeed><carriage-return> will ? (And also, that that is the character class I created).
Yes. Here are the docs for Python regular expressions as used in Mailman: https://docs.python.org/2.7/library/re.html.
In general this problem would be addressed with the DOTALL flag:
The special characters are:
'.'
(Dot.) In the default mode, this matches any character except a
newline. If the DOTALL flag has been specified, this matches any
character including a newline.
Note that the definition of "newline" here is exactly "\n".
However, in your case I think there's a simpler method.
For production I might need to put [\s\S\n\r]* between every pair of characters after a reasonable point in the expression. Unless I can enumerate the possibilities more precisely. (Which will probably result in an even longer looking character class).
Well, actually what you need is just "\s*" (or perhaps "\s+" or "(\s|_)+") wherever a space might occur in the topic regexp, I think. Line folding can only occur at whitespace (breaking this rule would be noticed by everybody, and so is not likely to go unfixed), and "\s" already includes "\n".
Empirically I see ?=\n =?utf-8?q?_ after "Weekly" and before "Ac". (And it seems the matching is done on the incoming subject, not the one formatted for resending, which, with my tag, and the utf-8 of an incoming tag pushes the expression entirely onto the second line where I think the ".*" variant (or even [_ ]) would match.
That would explain your observations, but I am not familiar with the topic code. I don't have time to address that until the weekend, and maybe not then as $DAYJOB is piling up work on me, and Mark is on vacation in Croatia, so you may have to wait a bit for a final answer on that. I'm sorry about that, but I think at least for now the "\s*" bandaid will get you most of the way to where you want to go.
On 11/24/2015 06:53 PM, Stephen J. Turnbull wrote:
Adrian Pepper writes:
Am I correct in my conclusion that .* won't match newline characters, but <space-chars><not-space-chars><linefeed><carriage-return> will ? (And also, that that is the character class I created).
Yes. Here are the docs for Python regular expressions as used in Mailman: https://docs.python.org/2.7/library/re.html.
In general this problem would be addressed with the DOTALL flag:
The special characters are: '.' (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
Note that the definition of "newline" here is exactly "\n".
Note you can turn on DOTALL in the regexp itself. so while
Farmers[_ ]Weekly.*Ac
doesn't match,
(?s)Farmers[_ ]Weekly.*Ac
will (see docs referenced above).
Empirically I see ?=\n =?utf-8?q?_ after "Weekly" and before "Ac". (And it seems the matching is done on the incoming subject, not the one formatted for resending, which, with my tag, and the utf-8 of an incoming tag pushes the expression entirely onto the second line where I think the ".*" variant (or even [_ ]) would match.
This is all a bug in not decoding RFC2047 encoded headers before matching. See <https://bugs.launchpad.net/mailman/+bug/891676> fixed in Mailman 2.1.15.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (3)
-
Adrian Pepper
-
Mark Sapiro
-
Stephen J. Turnbull