On Mailman-Users, Mark Sapiro writes:
Further, in the ban_list (and many other places in Mailman) if an address is intended to be a regular expression pattern, it must begin with '^', so you really want
^.*@domain\.com$
to match any_address@domain.com.
I hope we haven't propagated this rather user-unfriendly interface (the convention of accepting both regexps and literals, distinguishing by "^" in column 0) to Mailman 3. Even as a Python programmer, I find Mark's post somewhat confusing: I would design filters using re.search, so that the above would actually be equivalent as a Python regular expression to r"@domain\.com$". OTOH, if the implementation uses re.match, the "^" is redundant, so I have a "say what?!" event.
If we have, I propose changing it to
Ban these addresses, one entry per line: [ ]
[ ] Entries are regular expressions.
or something like that. We also ought to have a "Python features for Mailman administrators" section of the FAQ, starting with "what is a regular expression", and giving examples of how to accomplish common tasks like banning a whole domain with regular expressions. Typical regexp FAQs are hard for non-programmers (and even beginning programmers) to grasp.
I don't have time to actually work on these now, but if there's uptake on the suggestion ("let's think about it" at +0 or above :-) I'll file issues.
On 02/26/2016 09:02 PM, Stephen J. Turnbull wrote:
On Mailman-Users, Mark Sapiro writes:
Further, in the ban_list (and many other places in Mailman) if an address is intended to be a regular expression pattern, it must begin with '^', so you really want
^.*@domain\.com$
to match any_address@domain.com.
I hope we haven't propagated this rather user-unfriendly interface (the convention of accepting both regexps and literals, distinguishing by "^" in column 0) to Mailman 3. Even as a Python programmer, I find Mark's post somewhat confusing: I would design filters using re.search, so that the above would actually be equivalent as a Python regular expression to r"@domain\.com$". OTOH, if the implementation uses re.match, the "^" is redundant, so I have a "say what?!" event.
I agree it's confusing, and I've been caught in this confusion myself and neglected to put the leading ^ in what I clearly intended to be a regexp, but the convention goes back a long way in MM2.
If we have, I propose changing it to
Ban these addresses, one entry per line: [ ] [ ] Entries are regular expressions.
or something like that. We also ought to have a "Python features for Mailman administrators" section of the FAQ, starting with "what is a regular expression", and giving examples of how to accomplish common tasks like banning a whole domain with regular expressions. Typical regexp FAQs are hard for non-programmers (and even beginning programmers) to grasp.
I don't have time to actually work on these now, but if there's uptake on the suggestion ("let's think about it" at +0 or above :-) I'll file issues.
I'm not sure what the MM3 story is at this point, but +1 for Steve's idea.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro writes:
I agree it's confusing, and I've been caught in this confusion myself and neglected to put the leading ^ in what I clearly intended to be a regexp, but the convention goes back a long way in MM2.
Oh, of course I'm -1 on changing "regexps start with '^'" convention in Mailman 2 myself!
On Feb 27, 2016, at 02:02 PM, Stephen J. Turnbull wrote:
I hope we haven't propagated this rather user-unfriendly interface (the convention of accepting both regexps and literals, distinguishing by "^" in column 0) to Mailman 3.
Sadly, it's true.
Mostly this is historical since we've essentially just ported the data and code from Mailman 2. It was implemented this way because of the limitations for data modeling, and the unsophisticated web ui in MM2.
We could do better in MM3, both because we can model the data better, we can expose the distinction in REST, and Postorius could expose the difference in a much better web ui.
Here's a rough sketch of what you'd have to do in the core to make this change. As always merge requests are welcome!
IBan would need to have a flag which indicate whether the email
is a literal
address or a pattern. I don't think it's worth having two separate
interfaces/models, but we might want to rename email
to something more
generic (pattern
would be fine, with the understanding that is_regexp=False
means the pattern is a literal). You'll need to change a bunch of checks and
what-not in the ban management code.
This also shows up in AcceptableAliases, so a similar change would have to be made to IAcceptableAlias, the various model implementation bits of that interface, and the implicit_dest.py rule.
The REST API for these would probably need some additional work, but that can't easily be done. The trickiest part would be if IBan.email is renamed, in which case you'd probably want to continue to accept the old data format for the 3.0 API (and translate it into the new model layer), but only accept the new data format in the 3.1 API. There are examples of how to do API-version differentiation.
It's still used in the *_these_nonmember checks (moderation.py rule), but as these are legacy facilities from Mailman 2, I'm not sure they need to change. Eventually, we want to remove these settings anyway, since all the functionality is implemented differently and better in MM3 already.
Another odd use of this is in the withlist
subcommand.
(It's also used in the wsgiref/falcon plumbing layer, but since that's all internal implementation details, nothing here needs to change.)
You'd need to handle database migrations and documentation updates too, along with a robust test suite, but there's nothing intractable about any of this.
Cheers, -Barry
Barry Warsaw writes:
IBan would need to have a flag which indicate whether the
pattern
would be fine, with the understanding that is_regexp=False means the pattern is a literal).
Are regexps sufficiently slow that *always* using a regexp would hurt performance?[1] The model I really had in mind was to always use regexps, and have a flag in the UI (Postorius) to regexp-quote when the user wants a literal.
Or we could continue to have the core representation be "leading '^' iff regexp", and once again have Postorius prepend "^.*" or whatever.
Footnotes: [1] XEmacs actually checks whether a regexp contains any regexp operators and automatically switches to a very fast literal search if not.
On Mar 01, 2016, at 04:37 AM, Stephen J. Turnbull wrote:
Are regexps sufficiently slow that *always* using a regexp would hurt performance?[1] The model I really had in mind was to always use regexps, and have a flag in the UI (Postorius) to regexp-quote when the user wants a literal.
I think it's less about performance and more about being explicit. My own sense is that literals are more common than regexps, and that in general regexps are more difficult to understand, but I don't have a lot of data points to back that up.
Or we could continue to have the core representation be "leading '^' iff regexp", and once again have Postorius prepend "^.*" or whatever.
In which case, the core's model wouldn't have to change, right?
I really want to avoid regexp-quoted strings for literals in the model. I'm fine if the core model doesn't change but Postorius makes things nicer for the user.
Cheers, -Barry
Barry Warsaw writes:
On Mar 01, 2016, at 04:37 AM, Stephen J. Turnbull wrote:
Or we could continue to have the core representation be "leading '^' iff regexp", and once again have Postorius prepend "^.*" or whatever.
In which case, the core's model wouldn't have to change, right?
That's the point, yes!
I really want to avoid regexp-quoted strings for literals in the model. I'm fine if the core model doesn't change but Postorius makes things nicer for the user.
OK on avoidance, you're the FLUFL after all! If Terri or Florian doesn't pipe up with total hate soon (== by the time I getta round tuit), I'll file a feature request.
Steve
On Tue, Mar 01, 2016 at 04:37:16AM +0900, Stephen J. Turnbull wrote:
Barry Warsaw writes:
IBan would need to have a flag which indicate whether the
pattern
would be fine, with the understanding that is_regexp=False means the pattern is a literal).Are regexps sufficiently slow that *always* using a regexp would hurt performance?[1] The model I really had in mind was to always use regexps, and have a flag in the UI (Postorius) to regexp-quote when the user wants a literal.
Or could we meet user expectations (real users, not geeks), and just interpret * and ? (for example) as being regexp values, as well as letting power users use more complicated regexps?
Essentially the two classes:
Simples: *@mail.ru *@*mail.ru ?????@mail.ru
Power-user: ^.*\+.*?\d{3,}@ \.*j\.*o\.*e\.*b\.*l\.*o\.*w\.*+.*@gmail\.com
and the sort we saw in the threads around bot subscriptions and
regexps on Mailman-user?
Off the top of my head, the syntax would define if it's an absolute address (foo@example.com) vs a regexp.
-- "I never make predictions. I never have, and I never will." -- Tony Blair
Adam McGreggor writes:
Or could we meet user expectations (real users, not geeks), [and allow glob syntax].
Definitely worth discussing, but my initial reaction is negative for the reasons discussed below.
Simples: *@mail.ru *@*mail.ru ?????@mail.ru
Are those anchored? At the beginning of string? At end? Is there really a use case for "?"? I don't see this as an obvious feature. Globs are also too blunt for the use case, especially since bad actors do deliberately use fine distinctions between well-known domains and their own sinkholes of depravity when phishing. Users are likely to be lazy, using "*@*mail.ru" to catch both "badactor@mail.ru" and "badactor@spamsource.mail.ru", trashing "niceguy@goodmail.ru"'s posts in the process.
Off the top of my head, the syntax would define if it's an absolute address (foo@example.com) vs a regexp.
"foo@example.com" is unambiguous, but "foo+mailman@example.com" is not. That's a big trap for users, who surely know exactly what they mean by that (and it's not foooooooooooomailman@example.com!)
In theory we could use globs as well (some of the modern VCSes permit glob or regexp syntax), but it's not a serious data loss issue for a VCS if a mistake is made. You just run the add command again with -f, or uncommit, or whatever. Granted, a perverse enough user could fail to add a file, commit, then overwrite the file, but this is much less serious than the possibility that a particular user would end up as collateral damage to a spam filter.
Steve
On Mar 01, 2016, at 11:13 PM, Stephen J. Turnbull wrote:
In theory we could use globs as well (some of the modern VCSes permit glob or regexp syntax), but it's not a serious data loss issue for a VCS if a mistake is made. You just run the add command again with -f, or uncommit, or whatever. Granted, a perverse enough user could fail to add a file, commit, then overwrite the file, but this is much less serious than the possibility that a particular user would end up as collateral damage to a spam filter.
globs make sense for file system operations, and we've been using them for decades in shells. I think globs make less sense for header value pattern matching.
Cheers, -Barry
On Tue, Mar 01, 2016 at 09:26:13AM -0500, Barry Warsaw wrote:
globs make sense for file system operations, and we've been using them for decades in shells. I think globs make less sense for header value pattern matching.
Looking at my sieve/procmail recipes, I rarely use globs (except in blacklisting), it seems.
In the blacklisting case, it's against words in Subject: lines, as well as Sender:/From: headers. I'd imagine (for those still using such things), that's a fairly common approach.
-- "Ink is handicapped, in a way, because you can blow up a man with gunpowder in half a second, while it may take twenty years to blow him up with a book. But the gunpowder destroys itself along with its victim, while a book can keep on exploding for centuries." -- Christopher Morley
On Tue, Mar 01, 2016 at 11:13:18PM +0900, Stephen J. Turnbull wrote:
Adam McGreggor writes:
Or could we meet user expectations (real users, not geeks), [and allow glob syntax].
Definitely worth discussing, but my initial reaction is negative for the reasons discussed below.
Simples: *@mail.ru *@*mail.ru ?????@mail.ru
Are those anchored? At the beginning of string? At end?
'throughout'.
Is there really a use case for "?"? I don't see this as an obvious feature.
I'd imagine there could be some use for people wanting say, to handle five-character localparts of an address, although it's an in-elegant approach, it's something a user can understand, without needing to understand regexp ("all our new subscriptions are five characters before the @ sign. I want to block them").
Globs are also too blunt for the use case, especially since bad actors do deliberately use fine distinctions between well-known domains and their own sinkholes of depravity when phishing.
True. (I was picking on mail.ru, as it's one of the common ones that I find quite irresponsible).
Users are likely to be lazy, using "*@*mail.ru" to catch both "badactor@mail.ru" and "badactor@spamsource.mail.ru", trashing "niceguy@goodmail.ru"'s posts in the process.
Are they going to use *@* necessarily, or just *@? (unless they want subdomains when "*@*.mail.ru" might be acceptable).
Off the top of my head, the syntax would define if it's an absolute address (foo@example.com) vs a regexp.
"foo@example.com" is unambiguous, but "foo+mailman@example.com" is not. That's a big trap for users, who surely know exactly what they mean by that (and it's not foooooooooooomailman@example.com!)
Agree.
-- "applying logic to English slang is never a sound idea" -- Stephen Fry
participants (4)
-
Adam McGreggor
-
Barry Warsaw
-
Mark Sapiro
-
Stephen J. Turnbull