Re: [Mailman-Users] Allowing users to join without specifying pas swords

On Friday, June 15, 2001, at 01:19 PM, Barry A. Warsaw wrote:
CVR> points. but we need to quantify what those points are and CVR> what the impact is, so we can decide just how to move forward CVR> on this.
I'd love to see any statistic you (or anybody) gathers on this subject. It's definitely intriguing, but right now I don't have the time or systems to do this kind of data gathering.
Okay, here's a first cut at some data.
I'm going to assume the following:
1000 subscribers -- no digest subscribers to simplify this. Assume just individual messages.
The message size is 10K, including header.
The bandwidth needed to generate a connection to send a message is 1K (which is pretty close)
The bandwidth needed to add an address to an existing message is about 1/10 of a K (also pretty close).
The practical limit to the number of messages you can piggyback is 100, since this is specified in RFC2821 as the smallest number a site is REQUIRED to take. In practice, due to non-conformant sites, you have to be careful setting it beyond 50 these days, because sites set this number down because they think it slows down the spammers (I'm yet to be convinced it makes a damn bit a difference, especially since MTAs like postifx recognize the 452 and auto-adjust now. This is another place where sendmail seems behind the technology curve, FWIW)
How much bandwidth is used depends on these factors:
what your piggyback value is (in mailman, it's SMTP_MAX_RCPTS)
how many domains have > 1 subscriber.
Here's how plaidworks breaks down:
3101 subscribers across 1287 domains. that's an average of 2.3 subscribers per domain, but the numbers skew wildly, so averages are meaningless.
Here's how my site breaks down:
# of subscribers # of domains/# of users
1 263/263 2 142/284 3 40/120 4 19/76 5 16/80 6 10/60 7 7/49 8 3/24 9 6/54 10 2/20 11 2/22 12 2/24 13 1/ 14 1/ 16 1/ 17 1/ (worldnet.att.net) 22 1/(juno.com) 29 1 (mindspring.com) 30 1 (pacbell.net) 35 1 (plaidworks.com) 43 1 (sympatico.ca) 53 1 (earthlink.net) 150 1 (home.com) 173 1 (yahoo.com) 228 1 (hotmail.com) 441 1 (aol.com)
if you're scoring at home, 37% of subscribers come from that last 4 domains: 5% for home and yahoo, 7% for hotmail, and 14% for aol. those are your 500 pound gorillas (AOL is 800 pounds), and piss them off at your own risk.
At the other end, 8% of your users are the only subscriber from a domain. 16% are 1 or 2 per domain. 26% are on sites with 5 or fewer subscribers.
Time for some numbers.
Back to the 1000 member list for simplicity. The subscriber list breaks down to:
85 - 1/85 45 - 2/90 12 - 3/36 6 - 4/24 [...] 48 - 1 55 - 1 73 - 1 142 - 1
That's 553, or 55% of the subscribers, wedged tightly on both ends of the curve. We can extrapolate what they'll do to bandwidth from the end cases if we need to.
Extreme case: SMTP_MAX_RCPTS = 1.
1000 subscribers * (10K message size + 1K overhead) = 11,000K bytes bandwidth.
Extreme case: SMTP_MAX_RCPTS = 100
These get sent down the line this way:
85 * 11K 45 * (1 * 11K + 1 * .1K) 12 * (1 * 11K + 2 * .1K 6 * (1 * 11K + 3 * .1K) [...] 1 * 11K + 47 * .1K 1 * 11K + 54 * .1K 1 * 11K + 72 * .1K 2 * 11K + 140 * .1K
Do you see how I got these numbers? In the case of the 12 domains with three subscribers, you have to make an 11K connection for the first message, and piggy back on the other two addresses at .01K each. You don't really see huge savings until the big domains, and you'll see AOL goes over the 100 address limit so gets split into two different messages.
For this 55%, the SMTP=1 is 6050K. For 100, it's 1711K bytes. That's 28% of the first number, so we're cutting 72% of the bandwidth by chunking at 100. The tradeoff is performance, though -- it takes a lot longer to deliver those AOL addresses, because if you split it into two batches, you can't parallelize the delivery. Package up 100 AOL addresses in one batch, none of them get delivered until all 100 addresses are sent to AOL and accepted. It's much faster to send them as ten batches of ten in parallel -- but that's the trade off here. Cut network bandwidth but slow delivery to the larger domains.
Okay, let's look at a case in the middle. SMTP_MAX = 5. The ones with less than 5 don't change, but the big domains do
85 * 11K 45 * (1 * 11K + 1 * .1K) 12 * (1 * 11K + 2 * .1K 6 * (1 * 11K + 3 * .1K) [...] 1 * (10 * 11K + 38 * .1k) 1 * (11 * 11K + 44 * .1K) 1 * (15 * 11K + 58 * .1K) 1* (29 * 11K + 113 * .1K)
that works out to (trust me) about 2378K, or about a 60% reduction.
Let's try SMTP_MAX = 2.
85 * 11K 45 * (1 * 11K + 1 * .1K) 12 * (2 * 11K + 1 * .1K 6 * (2 * 11K + 2 * .1K) [...] 1 * (10 * 11K + 38 * .1k) 1 * (11 * 11K + 44 * .1K) 1 * (15 * 11K + 58 * .1K) 1* (29 * 11K + 113 * .1K)
that works out to 2575K, or about a 57% cut.
By a rough look at those domains in the middle, I'd say these numbers are good +-10%.
What's this mean? Here's the executive summary:
The network penalty between SMTP_MAX = 1 (effectively VERP) and any kind of batching (SMTP > 1) is roughly 50%. To get VERP or customized footers or customized anything, you double your network bandwidth.
There is very little advantage to setting SMTP_MAX > 5, UNLESS your subscriber base is heavily stratified onto very few sites. If you have really large groups of subscribers on AOL or Hotmail, it can help cut network bandwidth, but at best, it seems to be about a 10% improvement. If you plot the numbers I did on a curve, you can see just how little advantage you get by increasing the number. You get almost all of the advantage by going to 2, and the line past 5 is very flat....
Interesting -- I honestly didn't expect to see THIS big a difference -- I was expecting more like 25-30% increase in bandwidth for a VERP-type delivery.
My thoughts on what this means to future directions:
Customized messages (VERPing, or encoded unsub URLs, or all of that...) should definitely be an option in Mailman 2.1.
I would set Mailman's 2.1 default to have this turned ON, giving us the customized unsub links and etc, but to document this for users so they know to turn it off on slow networks.
If users turn it off, I recommend that SMTP_MAX be set by default to 5, and that we document that it makes little sense to change it unless a site is horribly network limited, because even setting to the max only gains them another 10% (and if they're THAT network limited, they're seriously asking for trouble anyway), and only if their subscriber base fits a profile that lends itself to the compression. Setting it large also leaves them open to spamblocking by systems that don't necessarily follow the standards or act right, too.
We should ALSO note here that some MTAs (postfix, for instance) might override SMTP_MAX anyway -- you could set it to 100, but postfix might be configured smaller, so they have to be aware of those potential interactions. you then get into the issues of tuning all this, with few delivery threads with lots of addresses vs many threads in parallel.. and all that fun -- I guess I'm trying to say that you can't tune mailman in isolation from the MTA (and down that road lies a huge rathole of attempting to document this stuff...)
But from these numbers, any 2.0.x version of mailman should set SMTP_MAX to between 2 and 5, unless they're horribly network limited. it makes no sense to be larger than 5, and it makes no sense to be 1 unless you've done some kind of VERPing patch.
for 2.1, we want to implement these customizations and default them on, but with a 50% network hit, we definitely want to make it clear what's going on and make it possible for them to turn it off and return to a generic URL and non-customized e-mail.
Barry's mileage may vary on his preferences for default, of course, and it's his show. I think the advantages of the customized URL/email capability is a huge one and most sites will benefit from it -- but the network hit might kill some sites, so we have to give them an easy ability to turn the feature off.
What do y'all think? I've included mailman-developers on this reply, since while this started on mm-users, it really ought to be discussed on the developers list...
-- Chuq Von Rospach, Internet Gnome <http://www.chuqui.com> [<chuqui@plaidworks.com> = <me@chuqui.com> = <chuq@apple.com>] Yes, yes, I've finally finished my home page. Lucky you.
Yes, I am an agent of Satan, but my duties are largely ceremonial.

@ Chuq Von Rospach (chuqui@plaidworks.com) :
Okay, here's a first cut at some data.
Very interesting! Doubling the bandwith needed might or not be a problem. My systems delivers 10K messages at the mean rate of 18000 per hour. When I send a message down a 130 000 subscribers' list, well, I have to wait quite long befoore it's gone ; maybe I should buy some more bandwith. I'm ready to take up the VERP approach, even if it means twice longer sends.
However there's another performance issue: how does postfix react when it's got sent 130 000 emails to store and forward? Currently I send it about 10000 messages that it breaks up one by one using. I don't know about memory or disk issues there, but 130 000 * 10k = 1.3 Gbytes on disk: there might be more to consider than speed issues. I'd love to see it working that way though. Would spare a lot users' and admin's nerves.
(By the way currently my larger lists aare not handled by Mailman but by sympa, 'coz I needed to keep a copy of the subscribers' names in order to spot them easier in the list when they want to unsubscribe and don't know what their address was...)
What do y'all think? I've included mailman-developers on this reply, since while this started on mm-users, it really ought to be discussed on the developers list...
-- Fil

On Sunday, June 17, 2001, at 03:20 AM, Fil wrote:
My systems delivers 10K messages at the mean rate of 18000 per hour. When I send a message down a 130 000 subscribers' list, well, I have to wait quite long befoore it's gone ; maybe I should buy some more bandwith.
It depends on how time sensitive the stuff is and when you send it. If you send out at 10PM and expect people to read it when they wake up, does it matter if they actually get it at 3AM or 6AM? But if it's time sensitive -- you have to worry about that issue. If the delay adds a few minutes, no big deal (but if it adds a few minutes * 100 messages a day; big deal, especially for the messages at the end of the day where you can get the "bad weather over O'hare" problem...)
However there's another performance issue: how does postfix react when it's got sent 130 000 emails to store and forward?
Disk I/O. That's the other issue here. You're adding to yoru network usage, you're also adding to your Disk I/O problem.
Currently I send it about 10000 messages that it breaks up one by one using. I don't know about memory or disk issues there, but 130 000 * 10k = 1.3 Gbytes on disk:
It's a problem. network and disk are (IMHO) the two big performance issues in delivering e-mail (at least the two under your control. The third is the speed at which receiving machines will accept messages, but you can't buy everyone in the universe faster e-mail servers...)
the amount of disk it takes isn't an issue (within reason) -- remember, it's going to start sending right away, so message will be gone out of the queue. You don't queue it all up and then start delivery. But this generates disk IO, and that can start bogging you down if you don't tune things properly. That means taking advantage of sub-folders in your mail queue for any MTA that allows them (the #1 performance death for a typical sendmail system: write-locks on /var/spool/mqueue, since every sendmail process has to create files in that directory. If you haven't set up subfolders, you are (I say this in a nice way) an idiot. if you aren't using a version of sendmail that allows for them, I'll call you an idiot in a not-nice way. Anyone not running at least 8.10 is hosed, so forget them...
Most MTAs are sensitive to this and work to minimize the impact. Even sendmail finally figured it out. But you still need, in large e-mail environments, to look at splitting this across heads and spindles. My experiments have indicated you're better off having mail on separate spindles than you are building a RAID using those same spindles, for whatever that's worth. And if you have lots of RAM, you can start using ram disks, and then you have lots of fun... (yes, I've done that. It's amazing how much faster sendmail is when you remove the disk I/O on those directory inodes...)
-- Chuq Von Rospach, Internet Gnome <http://www.chuqui.com> [<chuqui@plaidworks.com> = <me@chuqui.com> = <chuq@apple.com>] Yes, yes, I've finally finished my home page. Lucky you.
95% of being a net.god is sounding persuasive and convincing people you know what you're talking about, even when you're making it up as you go along. (chuq von rospach, 1992)

On Sun, 17 Jun 2001, Chuq Von Rospach wrote:
For this 55%, the SMTP=1 is 6050K. For 100, it's 1711K bytes. That's 28% of the first number, so we're cutting 72% of the bandwidth by chunking at 100. The tradeoff is performance, though -- it takes a lot longer to deliver those AOL addresses, because if you split it into two batches, you can't parallelize the delivery.
Please don't make this assumption. It is true for the commonly used Unix MTAs, but it is not true for all MTAs. My MTA has no problems with parallelizing delivery out of a single recieved message.
I agree that a smaller number would make sense for the default though, as I'm probably the only one here who isn't using a Unix-based MTA to do their deliveries. As long as it is still configurable I am happy.
Your method for figuring out bandwidth usage is interesting, and I think I'll do something similar for the recipient base and message sizes on my system. 10k is much larger than my average message size, but doing the same thing for digests (30k, and 65% of my readers are in digest mode) would be interesting. I'll report back to the list with my results if anyone else is interested.
I would set Mailman's 2.1 default to have this turned ON, giving us the customized unsub links and etc, but to document this for users so they know to turn it off on slow networks.
I would argue that it should default to OFF, as this is how Mailman has behaved for a couple of years. As long as I can easily turn it off before completing the install (by changing mm_cfg.py for instance) I am happy.
alex

On Sunday, June 17, 2001, at 06:21 AM, alex wetmore wrote:
Please don't make this assumption. It is true for the commonly used Unix MTAs, but it is not true for all MTAs.
you're misreading what I was doing here -- I'm looking at this based on how it goes over the wire, not how it's delivered to the MTA. So it doesn't matter if your MTA repackages it going over the wire or not -- I'm assuming it doesn't for simplicity, but if it does, you simply have to change the values to take that into consideration (and, for what it's worth, I did point out that postfix CAN do this, although I didn't clearly tie that back to this, because I thought the message was too long and opaque already...)
Your method for figuring out bandwidth usage is interesting, and I think I'll do something similar for the recipient base and message sizes on my system.
It's just a rough attempt to get in the ballpark, but I think the numbers are going to be fairly good for the general SMTP protocol.
10k is much larger than my average message size,
true. It was convenient (especially for me, since I needed that data for my big emarketing machine anyway, and our messages are 10-14K and 35-45K). If you go to smaller messages, the advantages of the buddying-up drops (for a 1K message, instead of N * 11K + M * .1K, it's N*2K + M * .1K) and the protocol overhead becomes more important. For larger messages, the advantage grows completely. For 1-2K messages, you might see the advantage drop to 30% or less, I haven't done the math.
This also ignores the MTA's ability to cache connections, by the way. But that's really a random process and impossible to model this way.
I would argue that it should default to OFF
I'm not surprised. It's Barry's call, but I think the customized URL is useful enough we want people to use it unless they have to turn it off, we don't want to have to try to convince the people who install stuff and leave everything defaulted to turn it on.
-- Chuq Von Rospach, Internet Gnome <http://www.chuqui.com> [<chuqui@plaidworks.com> = <me@chuqui.com> = <chuq@apple.com>] Yes, yes, I've finally finished my home page. Lucky you.
It's not the pace of life that concerns me, it's the sudden stop at the end.
participants (3)
-
alex wetmore
-
Chuq Von Rospach
-
Fil