[Mailman-Users] Archive merge and search

Mark Sapiro mark at msapiro.net
Tue Nov 18 17:31:18 CET 2014


On 11/18/2014 06:35 AM, Hal wrote:

> So for any new messages from now on I want my list to work this way:
> 
> 1) HTML formatted postings should be converted to plain text before
> reaching other members.


In Mailman's Content filtering msection you want the following:

filter_content: Yes
filter_mime_types: empty
pass_mime_types:
    multipart
    text/plain
    text/html
filter_filename_extensions: irrelevant, default list OK
pass_filename_extensions: empty
collapse_alternatives: Yes
convert_html_to_plaintext: Yes
filter_action: as desired, this will only apply to a message which
contains no text/html or text/plain part.


> 2) HTML formatted postings can retain their formatting for the archive
> (I believe the archive is in the HTML format anyway?), but if it only
> archives whatever is sent to list members I don't mind. The important
> thing is that members receive plain text messages.


What will be archived is what was delivered to list members.


> 3) Since many people have their email programs set by default to send in
> HTML these days I just want Mailman to do its filtering, then continue
> by sending the posting as plain text without any moderator request or
> alerting the sender.


Settings in 1) do that.


> 4) I'd like to block all attachements (list members should only receive
> plain text files).
> 40kb is already set for Max_message_size (in "General options" within
> the list administration web interface) which seems to have worked fine
> (as far as I know).


'attachments' is an imprecise word, but settings in 1) will do what you
want.


> Furthermore I understand that Filter_filename_extensions (in the
> "Content filtering" section) in addition removes any attachements based
> on specific filename *extensions* regardless of their file size?
> 
> I see exe, bat, cmd and a bunch of other filetypes I've never heard of
> (geared towards Windows/DOS users I suppose -I'm a Mac user) are listed,
> but I suppose I could block .zip and those pesky .vcf/.vcard and
> "winmail.dat" files the same way.


They will all be removed anyway unless they have a MIME Content-Type of
text/plain or text/html which is unlikely.


> When such extensions are encountered, are they just removed from the
> messages while the message posting itself is passed on to list members,
> or is the whole posting stopped for approval first?


They are just removed.


> I'm thinking out loud here, so feel free to chime in for better ideas,
> but I'm thinking there are two kind of attachement groups which need
> different actions to be taken:
> 
> Deliberate attachements: zip files, gif/jpg images etc. which a poster
> wants to share. The message/attachement should be stopped from reaching
> the list and an email sent to the poster with a "your message has been
> blocked. Please resend your message, this time without an attachement"
> type of message.


Content filtering will just remove them.


> Accidental attachements: winmail.dat, .vcf or .vcard an so on. Many
> users don't know (as with HTML postings) that their email program is set
> up to send this stuff. IMHO those attachements don't have anything to do
> with the actual content of their postings, so Mailman should just remove
> the attachement(s), then pass on the rest of the message to the list.


winmail.dat is really more of a 'deliberate' attachment. It is a message
part with MIME type application/vnd.ms-tnef which is a Microsoft
Outlook/Exchange 'transport neutral encapsulation format' way of
encoding attachments.

.vcf and .vcard 'attachments' have Content-Type text/vcard or possibly
application/vcard+json or application/vcard+xml.

In any case, since these do not have Content-Type text/plain or
text/html, they will be removed.


> Having said that, have I understood things correctly by setting up my
> "Content filtering" options as follows? (based on what you've said and
> what I've read here:
> http://wiki.list.org/pages/viewpage.action?pageId=4030684):
> 
> Edit_filter_content:    YES
> Filter_mime_types:    (left blank)
> Pass_mime_types:    multipart
>             message/rfc822
>             text/plain
>             text/html
> filter_filename_ext.:    exe
>             bat
>             cmd
>             com
>             pif
>             scr
>             vbs
>             cpl
>             zip
>             dat
>             vcf
>             vcard
> pass_filename_ext.:    (left blank)
> Collapse_alternatives:    YES
> conv_html_to_plaintext:    YES
> Filter_action:        DISCARD


Maybe. The only difference between this and 1) above is message/rfc822.

If I forward a message to your list as an 'attachment', do you want to
remove that forwarded message from my post or do you want to accept the
plain text or possibly HTML converted to plain text parts of that
forwarded message?

If you want to remove it, leave message/rfc822 out of the list, if you
want to accept the result of applying contentent filtering to it, put
message/rfc822 in the list.


>>> Failing that, is there a way I could have the (currently private)
>>> archive have a filter before HTTP access?
>>
>> You could create your own CGI or other web process to access the
>> archives and present them any way you want.
> 
> Being ignorant on the subject, what kind of pre-written CGI script
> should I try to find (i.e. "search engine to web archive gateway" or
> something like that?).


I doubt very much that you'll find anything pre-written that will meet
your needs.


> You previously suggested htdig (http://www.htdig.org/) with your patches
> for allowing my visitors to search through both the Mailman archives and
> my website.


To be clear, htdig is a search engine that can index and search all or a
portion of your web site. The patches developed by Richard Barrett and
currently supported by me add a search form to the main archive table of
contents page for a list and invoke htdig to do the search. This search
is only of the archive of that list.


> Assuming this is a more ready-to-use solution than the other
> search engines out there,


For a general search of your web site, probably not a good assumption.


> are there features I will be missing out on
> (e.g. the ability to use CSS and Ajax for making its search results
> appear more in line with the rest of my website) and is it still secure?
> I've read that malicious code can sometimes be entered as search phrases
> and damage the database if the search engine isn't using "parametrized
> queries".


I don't think that malicious search phrases is an issue with htdig, but
I don't know for sure that it isn't.

It probably wouldn't be too difficult to incorporate CSS into the search
results pages, but I've never tried it. Ajax might be more problematic.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


More information about the Mailman-Users mailing list