External subscriber lists in 2.1
Barry,
In 2.1 when used with external subscriber storage (eg SQL), will the new equivalent of qrunner request and load the entire subscriber DB onto the heap prior to broadcast?
<<Yeah, I'm cheating and being lazy on not checking the source myself>>
Reason: This poses scaling problems for lists with very large numbers of subscribers. I'd suggest paging thru the set in blocks.
ObExcuse: Chap on -users asking about millions of subscribers.
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.
"JCL" == J C Lawrence <claw@kanga.nu> writes:
JCL> In 2.1 when used with external subscriber storage (eg SQL),
JCL> will the new equivalent of qrunner request and load the
JCL> entire subscriber DB onto the heap prior to broadcast?
That's really up to the implementation of the MemberAdaptor interface for SQL (but fwiw, I'm not aware of such a beast). Mailman's CalcrRecips module loops through all the member addresses in a list comprehension, but all other information is requested a member at a time.
Note that if it was too expensive for getRegularMemberKeys() to return an in-memory list, it could (if you use Python 2.2) return an iterator object that implemented things in a more efficient manner, e.g. by paging through blocks. I believe that any place where we expect a Python sequence (list) we could probably accept an iterator.
JCL> ObExcuse: Chap on -users asking about millions of
JCL> subscribers.
Cool! :)
-Barry
On Sun, 23 Dec 2001, Barry A. Warsaw wrote:
Note that if it was too expensive for getRegularMemberKeys() to return an in-memory list, it could (if you use Python 2.2) return an iterator object that implemented things in a more efficient manner, e.g. by paging through blocks. I believe that any place where we expect a Python sequence (list) we could probably accept an iterator.
Do you know if this is in fact the case? Is this how I should implement it?
(While I've hacked in python fairly regularly, I've not done serious Python development since ~1996, so some of these questions will be fairly basic "How's it work in Python2.2?" ones. Of course, this brings up the question of whether or not it's reasonable to require that--don't we currently only require 2.1.3?)
Below are different ways the results of "get{Regular,Digest}MemberKeys" and "getMembers" are used in various places in the codebase. Will each of them still work if those methods return an iterator instead of a list? (For example, does len(iterator) work?)
I'm guessing that a good answer might be that we want to add more general-purpose accessor methods in class MemberAdaptor (and have the rest of the codebase use those where appropriate) so that if there's a more efficient way for a specific backend to provide specific data, it is able to do so. I'll include the two methods I propose at the end of this message.
in Handlers/CalcRecips.py: # Calculate the regular recipients of the message recips = [mlist.getMemberCPAddress(m) for m in mlist.getRegularMemberKeys() if mlist.getDeliveryStatus(m) == ENABLED] and: recips = mlist.getMemberCPAddresses(mlist.getRegularMemberKeys() + mlist.getDigestMemberKeys())
in Cgi/admin.py: if not mlist.nondigestable and mlist.getRegularMemberKeys(): and: if addr in mlist.getRegularMemberKeys(): and: # If there are more members than allowed by chunksize, then we split the # membership up alphabetically. Otherwise just display them all. chunksz = mlist.admin_member_chunksize all = mlist.getMembers() all.sort(lambda x, y: cmp(x.lower(), y.lower())) then: # BAW: There's got to be a more efficient way of doing this! names = [mlist.getMemberName(s) or '' for s in all] all = [a for n, a in zip(names, all) if cre.search(n) or cre.search(a)]
in HTMLFormatter.py: members = self.getRegularMemberKeys() for m in members: if not self.getMemberOption(m, conceal_sub): people.append(m) num_concealed = len(members) - len(people) and: member_len = len(self.getRegularMemberKeys())
def getNumMembers(self, type, status, options, regexp=None): """Get the number of members of this mailing list matching type and status, and optionally matching the regular expression passed in.
type is one of the module constants REGULAR, DIGEST, or EITHER.
The tally should include just the appropriate type of members.
status is a list containing some subset of the values ENABLED,
UNKNOWN, BYUSER, BYADMIN, BYBOUNCE. The tally should include only
members whose status is in that list.
options is a dictionary containing some number of {flag:boolean}
pairs. Only members with values matching that specified for each
flag in the dictionary should be included in the tally.
regexp is a string containing a regular expression to use as a filter.
If this is not None, only members whose CPE or NAME match the regexp
should be included in the tally. (I don't have in mind a more
efficient way to implement this in the SQL MemberAdaptor, unless the
only non-token element in the regexp is ".*", as that could be matched
using sql's wild-card "%". Just because it doesn't result in a more
efficient implementation in this case doesn't mean it shouldn't be
part of the interface, though.)
"""
raise NotImplemented
def getMemberIterator(self, type, status, options, style, order, regexp=None): """Get an iterator of members of this mailing list matching type and status, and optionally matching the regular expression passed in.
type, status, options, and regexp are as used in getNumMembers().
style is what content the iterator should contain: KEY, LCE, CPE, or
NAME (If NAME, and some users have no RealName set, those
iterator entries will be None.)
order specifies in which order those items should be returned by the
iterator: KEY, LCE, CPE, or NAME (This part hasn't been thought
through as thoroughly--are there other interesting orderings?)
"""
raise NotImplemented
The rest of this message is for context since I'm responding to something 8 months old.
On Sun, 23 Dec 2001, Barry A. Warsaw wrote:
"JCL" == J C Lawrence <claw@kanga.nu> writes: JCL> In 2.1 when used with external subscriber storage (eg SQL), JCL> will the new equivalent of qrunner request and load the JCL> entire subscriber DB onto the heap prior to broadcast?
That's really up to the implementation of the MemberAdaptor interface for SQL (but fwiw, I'm not aware of such a beast). Mailman's CalcrRecips module loops through all the member addresses in a list comprehension, but all other information is requested a member at a time.
JCL> ObExcuse: Chap on -users asking about millions of JCL> subscribers.
Cool! :)
On Sun, 23 Dec 2001, Barry A. Warsaw wrote:
Note that if it was too expensive for getRegularMemberKeys() to return an in-memory list, it could (if you use Python 2.2) return an iterator object that implemented things in a more efficient manner, e.g. by paging through blocks. I believe that any place where we expect a Python sequence (list) we could probably accept an iterator.
DN> Do you know if this is in fact the case? Is this how I should
DN> implement it?
Nope, never actually tried it. Note that in Python 2.2 you might actually want to try a generator, but again, I've not tried it at all.
DN> (While I've hacked in python fairly regularly, I've not done
DN> serious Python development since ~1996, so some of these
DN> questions will be fairly basic "How's it work in Python2.2?"
DN> ones. Of course, this brings up the question of whether or
DN> not it's reasonable to require that--don't we currently only
DN> require 2.1.3?)
Yes. Mailman 2.1 will work with Python 2.1.3 and beyond. The next version will require at least Python 2.2.1. So if you want to also support Py2.1, iterators and generators are out.
DN> Below are different ways the results of
DN> "get{Regular,Digest}MemberKeys" and "getMembers" are used in
DN> various places in the codebase. Will each of them still work
DN> if those methods return an iterator instead of a list? (For
DN> example, does len(iterator) work?)
len(iterator) doesn't work (unfortunately, IMO). So maybe we're screwed because we'd have to call list() on the thing to be able to give it to len(). This actually is the cause of a few recent bugs with email.Iterators.body_line_iterator(). OTOH, if the iterator object we return has an __len__() method, we should be okay.
DN> I'm guessing that a good answer might be that we want to add
DN> more general-purpose accessor methods in class MemberAdaptor
DN> (and have the rest of the codebase use those where
DN> appropriate) so that if there's a more efficient way for a
DN> specific backend to provide specific data, it is able to do
DN> so. I'll include the two methods I propose at the end of this
DN> message.
You're probably right.
in Handlers/CalcRecips.py: # Calculate the regular recipients of the message recips = [mlist.getMemberCPAddress(m) for m in mlist.getRegularMemberKeys() if mlist.getDeliveryStatus(m) == ENABLED] and: recips = mlist.getMemberCPAddresses(mlist.getRegularMemberKeys() + mlist.getDigestMemberKeys())
in Cgi/admin.py: if not mlist.nondigestable and mlist.getRegularMemberKeys(): and: if addr in mlist.getRegularMemberKeys(): and: # If there are more members than allowed by chunksize, then we split the # membership up alphabetically. Otherwise just display them all. chunksz = mlist.admin_member_chunksize all = mlist.getMembers() all.sort(lambda x, y: cmp(x.lower(), y.lower())) then: # BAW: There's got to be a more efficient way of doing this! names = [mlist.getMemberName(s) or '' for s in all] all = [a for n, a in zip(names, all) if cre.search(n) or cre.search(a)]
in HTMLFormatter.py: members = self.getRegularMemberKeys() for m in members: if not self.getMemberOption(m, conceal_sub): people.append(m) num_concealed = len(members) - len(people) and: member_len = len(self.getRegularMemberKeys())
| def getNumMembers(self, type, status, options, regexp=None):
| """Get the number of members of this mailing list matching type and
| status, and optionally matching the regular expression passed in.
Would it be good enough to do something like the following:
def getMatchingMembers(self, func): """Return a list of members for which function evaluates true.
For each member in the database, call func(), passing in the
member's subscribed address. If func() returns true, the address
is included in the returned list.
"""
likewise,
def getMatchingMembersCount(self, func): """Return a count of the members for which function evaluates true.
For each member in the database, call func(), passing in the
member's subscribed address. If func() returns true, that member
is included in the count.
"""
This might be a little less efficient than your APIs because of the function call, but it's more flexible. OTOH, it might be too flexible (it'd be hard to run this in a database or translate to selects, etc.), so I'm willing to be persuaded.
Also getMatchingMembers() would have to return a list for Py2.1 compatibility, but could return an iterator or generator in Py2.2+.
-Barry
On Tue, 20 Aug 2002, Barry A. Warsaw wrote:
Yes. Mailman 2.1 will work with Python 2.1.3 and beyond. The next version will require at least Python 2.2.1. So if you want to also support Py2.1, iterators and generators are out.
Right. That's probably a good enough reason to not even attempt getting this in to Mailman2.1 (but I would like to eventually see it in Mailman rather than a patch, as patches suffer from bit-rot quickly in systems undergoing active development).
len(iterator) doesn't work (unfortunately, IMO). So maybe we're screwed because we'd have to call list() on the thing to be able to give it to len().
Or have another way to not ask for the data from the MemberAdaptor, but rather ask for the number of data items. (So we never need to call len() on the results.)
OTOH, if the iterator object we return has an __len__() method, we should be okay.
That's not a bad idea, but might make cause portability problems down the road...
| def getNumMembers(self, type, status, options, regexp=None): | """Get the number of members of this mailing list matching type and | status, and optionally matching the regular expression passed in.
Would it be good enough to do something like the following:
def getMatchingMembers(self, func): """Return a list of members for which function evaluates true.
For each member in the database, call func(), passing in the member's subscribed address. If func() returns true, the address is included in the returned list. """
I don't want to have to pull the data for every possible match into Python to decide whether it should be included--I want to be able to ask the DB to decide that for us given specific constraints.
OTOH, it might be too flexible (it'd be hard to run this in a database or translate to selects, etc.)
Exactly.
so I'm willing to be persuaded.
I'd like to present to this function all the relevant information so that it can decide internally who matches and who does not. The API I designed does fit that bill, although I'm certain it's probably not ideal. Please offer suggestions if there's anything in there that you'd like to see changed. For now I'll just run with what I've got.
Also getMatchingMembers() would have to return a list for Py2.1 compatibility, but could return an iterator or generator in Py2.2+.
While I hope few mailing lists have millions of subscribers, since we'd like this tool to be able to handle that, I think I'm going to use iterators/generators (now I have to go learn about them! :-). If you think it's important to have this work w/Py2.1, I am willing to be persuaded otherwise.
-Dale
participants (3)
-
barry@zope.com
-
Dale Newfield
-
J C Lawrence