[Mailman-Developers] Improving the archives
Jeff Breidenbach
jeff at jab.org
Thu Aug 2 04:17:44 CEST 2007
> What we really want to know is how many (non-empty) Message-ID
> collisions are there that *don't* share a Date? This is the number of
> messages that only-messageid loses, and that the composite identifier
> method would not lose.
It took longer than expected, but I now have numbers from
looking at 2,151,896 messages spread over a few thousand
lists. The appended script was run over a set of MH format
raw messages.
704 messages fall into this category. Of these, 596 come from a
single (malfunctioning and duplicate spewing) list server. I have
not yet examined the remaining 208 messages, but I'll bet anything
many also have duplicate message bodies. Or are spam. So for this
data set, we have an upper bound of 0.01% messages in this
category, possibly significantly less.
Jeff
#!/bin/bash
#
# Look for messages that
#
# Do collide with message-id
# Don't collide with message-id + date
DIR=/home/archive/Mail
C1=0
C2=0
get_ineresting_messages() {
cd $DIR/$1
for j in $(ls -U); do
MSG_ID=$(cat $j | 822field message-id)
MSG_DATE=$(cat $j | 822field date)
if [ "$MSG_ID" != "" ]; then
echo $MSG_DATE "|" $MSG_ID
fi
done |\
sort |\
uniq --separator='|' --skip-fields=1 --all-repeated |\
uniq --uniq
}
for i in $(ls $DIR | grep @); do
DUP=$(get_ineresting_messages $i)
DUP_CNT=$(echo -n "$DUP" | wc -l)
MSG_CNT=$(cd $DIR/$i && ls -U | wc -w)
C1=$(( C1 + MSG_CNT ))
C2=$(( C2 + DUP_CNT ))
if [ $DUP_CNT != 0 ]; then
echo
echo "=== collisions/messages: $C2/$C1 $i"
echo "$DUP"
else
echo -n . 1>&2
fi
done
>
> -Dale
> _______________________________________________
> Mailman-Developers mailing list
> Mailman-Developers at python.org
> http://mail.python.org/mailman/listinfo/mailman-developers
> Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
> Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org
>
> Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp
>
More information about the Mailman-Developers
mailing list