What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose.
It took longer than expected, but I now have numbers from looking at 2,151,896 messages spread over a few thousand lists. The appended script was run over a set of MH format raw messages.
704 messages fall into this category. Of these, 596 come from a single (malfunctioning and duplicate spewing) list server. I have not yet examined the remaining 208 messages, but I'll bet anything many also have duplicate message bodies. Or are spam. So for this data set, we have an upper bound of 0.01% messages in this category, possibly significantly less.
#!/bin/bash # # Look for messages that # # Do collide with message-id # Don't collide with message-id + date
for j in $(ls -U); do
MSG_ID=$(cat $j | 822field message-id)
MSG_DATE=$(cat $j | 822field date)
if [ "$MSG_ID" != "" ]; then
echo $MSG_DATE "|" $MSG_ID
uniq --separator='|' --skip-fields=1 --all-repeated |
uniq --uniq }
for i in $(ls $DIR | grep @); do DUP=$(get_ineresting_messages $i) DUP_CNT=$(echo -n "$DUP" | wc -l) MSG_CNT=$(cd $DIR/$i && ls -U | wc -w) C1=$(( C1 + MSG_CNT )) C2=$(( C2 + DUP_CNT )) if [ $DUP_CNT != 0 ]; then echo echo "=== collisions/messages: $C2/$C1 $i" echo "$DUP" else echo -n . 1>&2 fi done
-Dale _______________________________________________ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org
Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp