Private archives available via Internet

It has come to our attention that mailman archives are available via the Internet even though mailman archives are deemed as private for mailing list members viewing only.
If you do a search on Google or any other search engine you can find any message that was posted to the mailing list. This is a problem for our private mailing lists.
Is there a way to ensure that this is not available? Also how do you get read of messages in the archives?
Nancy Montano
-- Nancy M. Montano || 224 Cruz Alta Rd, #F || Taos, NM 87571 Webmaster/Content Coord || nmontano@laplaza.org || http://www.laplaza.org
La Plaza Telecommunity || [V] 505-758-1836 || [F] 505-751-1812
"Aprender es avanzar"

At 08:47 28/03/2002 -0700, you wrote:
What are the URL paths being returned by the search engine.
Do they point to the web server delivering your Mailman web GUI? This is not such a dumb question as it might appear. It is entirely possible for a list subscriber to direct their incoming mail from the list to their own archive and it is this source which is being referenced by the search engines. Potentially the same content as your mailman archive just a different location/URL.
If the URLs being returned point to the web server delivering your Mailman web GUI, do they begin with the public archive alias (default /pipermail/) or the private script alias (default /mailman/private/) or some other path?
This could give some clue as to how a search engine's indexer gained access to the private mail archives you are concerned about.
Any list that was created as private and has stayed as private ever since can only be accessed using HTTP, on a _properly_ configured system, via mailman's CGI script in $prefix/Mailman/Cgi/private.py. And this script requires a list subscribed member e-mail id and associated password before it allows list access.
If the web server concerned was mis-configured, so that it could serve the pages directly from the private archive storage through the file system via some other URL path, rather than the proper CGI path of /mailman/private/<listname>/..., this could give a clue as to what is causing your problem.
If the lists concerned were at some time public then the indexer could have accessed them at that time but the URL paths returned by the search engine would be of the form /pipermail/<listname>/... and following the link should now fail if the list is now private.
But can you access the actual archive mail file via the URL returned by the search without having a valid member id and associated password?
The source of your problem will hinge in part on how the search engine indexers are crawling your web site. Is it pure 'arms length' HTTP access?
One of the problems with indexing Mailman private list archives to provide legitimate search facilities is the cookie authentication scheme used to control access by $prefix/Mailman/Cgi/private.py script. The indexers for some search engines are not programmed to handle this type of authentication. For instance, with the htdig search engine, in order to set up search of private list archives one has to do the indexing of them in the file space i.e. the indexer has to access the archive files through the filing system, and provide the indexer with a rule for mapping the file space paths back to the URLs that are to be returned in subsequent search results.
Is it possible that such an access path has been set up on your system for indexing private archives and that the index information has 'leaked' onto a publicly available search engine?
I see you have a search facility on your site. How is this implemented? Could this be the source of the leakage from the private mail archives to other search engines? How does your site search facility (it appears to be delivered by http://search.atomz.com/search/) do its indexing?
Also, I see your site makes use of PHP - no criticism intended - but the tools to drive a coach and horse through Mailman's attempts at archive security are ready to hand.
Is there a way to ensure that this is not available? Also how do you
Yes:
configure your mailman and associated web server correctly
control the setup of any local archive search facility you set up to ensure the information it holds does not leak to outside search engines.
add a restriction on access for /mailman/ to your site's robots.txt: yes I know! But some search engine crawlers honor it
get read of messages in the archives?
I assume you meant "get rid of messages in the archives". If so yes:
Edit the raw message in list's mailbox file $prefix/archives/private/<listname>.mbox to remove the offending messages.
Rebuild the archive using the command $prefix/bin/arch <listname>

At 08:47 28/03/2002 -0700, you wrote:
What are the URL paths being returned by the search engine.
Do they point to the web server delivering your Mailman web GUI? This is not such a dumb question as it might appear. It is entirely possible for a list subscriber to direct their incoming mail from the list to their own archive and it is this source which is being referenced by the search engines. Potentially the same content as your mailman archive just a different location/URL.
If the URLs being returned point to the web server delivering your Mailman web GUI, do they begin with the public archive alias (default /pipermail/) or the private script alias (default /mailman/private/) or some other path?
This could give some clue as to how a search engine's indexer gained access to the private mail archives you are concerned about.
Any list that was created as private and has stayed as private ever since can only be accessed using HTTP, on a _properly_ configured system, via mailman's CGI script in $prefix/Mailman/Cgi/private.py. And this script requires a list subscribed member e-mail id and associated password before it allows list access.
If the web server concerned was mis-configured, so that it could serve the pages directly from the private archive storage through the file system via some other URL path, rather than the proper CGI path of /mailman/private/<listname>/..., this could give a clue as to what is causing your problem.
If the lists concerned were at some time public then the indexer could have accessed them at that time but the URL paths returned by the search engine would be of the form /pipermail/<listname>/... and following the link should now fail if the list is now private.
But can you access the actual archive mail file via the URL returned by the search without having a valid member id and associated password?
The source of your problem will hinge in part on how the search engine indexers are crawling your web site. Is it pure 'arms length' HTTP access?
One of the problems with indexing Mailman private list archives to provide legitimate search facilities is the cookie authentication scheme used to control access by $prefix/Mailman/Cgi/private.py script. The indexers for some search engines are not programmed to handle this type of authentication. For instance, with the htdig search engine, in order to set up search of private list archives one has to do the indexing of them in the file space i.e. the indexer has to access the archive files through the filing system, and provide the indexer with a rule for mapping the file space paths back to the URLs that are to be returned in subsequent search results.
Is it possible that such an access path has been set up on your system for indexing private archives and that the index information has 'leaked' onto a publicly available search engine?
I see you have a search facility on your site. How is this implemented? Could this be the source of the leakage from the private mail archives to other search engines? How does your site search facility (it appears to be delivered by http://search.atomz.com/search/) do its indexing?
Also, I see your site makes use of PHP - no criticism intended - but the tools to drive a coach and horse through Mailman's attempts at archive security are ready to hand.
Is there a way to ensure that this is not available? Also how do you
Yes:
configure your mailman and associated web server correctly
control the setup of any local archive search facility you set up to ensure the information it holds does not leak to outside search engines.
add a restriction on access for /mailman/ to your site's robots.txt: yes I know! But some search engine crawlers honor it
get read of messages in the archives?
I assume you meant "get rid of messages in the archives". If so yes:
Edit the raw message in list's mailbox file $prefix/archives/private/<listname>.mbox to remove the offending messages.
Rebuild the archive using the command $prefix/bin/arch <listname>
participants (2)
-
Nancy Montano
-
Richard Barrett