Boilerplate and content filtering [was: Introduction and Project Discussion]
Sreyanth writes:
Also, I would like to hear more about : Boilerplate stripper AND Better content-filtering / handling error messages. Boilerplate stripping is trivial to understand. But, can anyone elaborate on Better content-filtering / handling error messages?
But boilerplate stripping is not necessarily trivial to implement, because it's not always clear what boilerplate is. I think it might be a good idea to save it off and provide a link rather than discard it, which leads to interesting questions of storage, shared links for true boilerplate (storage compression of repeatedly encountered text, yes, but more important the link will turn purple so you don't need to click on it in the next message from that user!), and user interface in general.
Content filtering is mostly going to be about MIME handling: choice of the appropriate text/* part and things like that, removing images/video/etc where the list prohibits them, converting HTML/ wordprocessor attachments to plain text, removing MIME parts whose Content-Type doesn't match filename or perhaps file(1) magic in the content, etc.
I can also imagine content filtering (or scoring!) based on word choice ("WTF" OK, spelling it out not :-). Also content filtering based on stripping out the quoted from top-posts and replacing them with links (after checking that the quoted material is indeed available in the archive!) All coming with on/off options, at least for those who remember the IBM 360saurus and other dinosaurs and still prefer mail to web. :-)
Error messages (I think this means delivery status notifications (DSN) from mail servers) are a similar kind of problem to text-based filtering, though somewhat more stylized.
Steve
- Stephen J. Turnbull <stephen@xemacs.org>:
Sreyanth writes:
Also, I would like to hear more about : Boilerplate stripper AND Better content-filtering / handling error messages. Boilerplate stripping is trivial to understand. But, can anyone elaborate on Better content-filtering / handling error messages?
But boilerplate stripping is not necessarily trivial to implement, because it's not always clear what boilerplate is. I think it might be a good idea to save it off and provide a link rather than discard it, which leads to interesting questions of storage, shared links for true boilerplate (storage compression of repeatedly encountered text, yes, but more important the link will turn purple so you don't need to click on it in the next message from that user!), and user interface in general.
Content filtering is mostly going to be about MIME handling: choice of the appropriate text/* part and things like that, removing images/video/etc where the list prohibits them, converting HTML/ wordprocessor attachments to plain text, removing MIME parts whose Content-Type doesn't match filename or perhaps file(1) magic in the content, etc.
Just to mention it: IF we are going to add MILTER functionality, a MILTER would be perfect to do MIME handling.
p@rick
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich
On Mon, Apr 15, 2013 at 11:56 AM, Patrick Ben Koetter <p@sys4.de> wrote:
- Stephen J. Turnbull <stephen@xemacs.org>:
Sreyanth writes:
Also, I would like to hear more about : Boilerplate stripper AND Better content-filtering / handling error messages. Boilerplate stripping is trivial to understand. But, can anyone elaborate on Better content-filtering / handling error messages?
But boilerplate stripping is not necessarily trivial to implement, because it's not always clear what boilerplate is. I think it might be a good idea to save it off and provide a link rather than discard it, which leads to interesting questions of storage, shared links for true boilerplate (storage compression of repeatedly encountered text, yes, but more important the link will turn purple so you don't need to click on it in the next message from that user!), and user interface in general.
Content filtering is mostly going to be about MIME handling: choice of the appropriate text/* part and things like that, removing images/video/etc where the list prohibits them, converting HTML/ wordprocessor attachments to plain text, removing MIME parts whose Content-Type doesn't match filename or perhaps file(1) magic in the content, etc.
Just to mention it: IF we are going to add MILTER functionality, a MILTER would be perfect to do MIME handling.
If we are going to add a MILTER functionality, even anti-spam filters can be at the least implemented. Isn't it? Some days ago, we were discussing about MILTERs in anti-spam context right. Now, a piece of anti-spam AND anti-abuse can be implemented at this level! I have implemented a binary Bayesian classifier which classifies an email either spam or not spam. Using it, making use of the main keywords in the email as vectors and learning from the reportedly-spam emails from the logs, we can implement this. After classifying an email as spam, we can display a line, may be, as "This may be spam. Please be careful while clicking on links or replying to this email with sensitive information!". So, using this we can enhance the usage of MILTER at the same time doing the MIME handling. Correct me if I am wrong. :)
p@rick
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich
Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://wiki.list.org/x/AgA3 Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/sreyanth%40gmail.c...
Security Policy: http://wiki.list.org/x/QIA9
-- *Yours Sincerely* * * *Mora Sreyantha Chary* *Computer Engineering '14* *National Institute of Technology Karnataka* *Surathkal, India 575 025*
On Mon, Apr 15, 2013 at 10:01 AM, Stephen J. Turnbull <stephen@xemacs.org>wrote:
Sreyanth writes:
Also, I would like to hear more about : Boilerplate stripper AND Better content-filtering / handling error messages. Boilerplate stripping is trivial to understand. But, can anyone elaborate on Better content-filtering / handling error messages?
But boilerplate stripping is not necessarily trivial to implement, because it's not always clear what boilerplate is. I think it might be a good idea to save it off and provide a link rather than discard it, which leads to interesting questions of storage, shared links for true boilerplate (storage compression of repeatedly encountered text, yes, but more important the link will turn purple so you don't need to click on it in the next message from that user!), and user interface in general.
Yep! But, how about this? Just hide the boilerplate in the email, giving a link to click. When clicked, use js to unhide the boilerplate. This would not anyhow require separate storage. Suggest me something if this is bad!
Content filtering is mostly going to be about MIME handling: choice of the appropriate text/* part and things like that, removing images/video/etc where the list prohibits them, converting HTML/ wordprocessor attachments to plain text, removing MIME parts whose Content-Type doesn't match filename or perhaps file(1) magic in the content, etc.
This is cool! I have done converting the HTML attachments to plain text in one of my projects, but never worked on wordprocessor attachments (some library for Python should be there, will check!). So, for this, we will additionally have to provide the admin with various other options, like to prohibit images, videos etc. Instead an additional option may also be given to the admin like using the boilerplate idea you proposed. Store them somewhere, display these as attachments to the email. Correct me if I am in the wrong path!
I can also imagine content filtering (or scoring!) based on word choice ("WTF" OK, spelling it out not :-). Also content filtering based on stripping out the quoted from top-posts and replacing them with links (after checking that the quoted material is indeed available in the archive!) All coming with on/off options, at least for those who remember the IBM 360saurus and other dinosaurs and still prefer mail to web. :-)
So, we will have to index the archives properly so that even if the post is not quoted properly in the email, it must be linked to the appropriate material. This will be cool, isnt it?
Error messages (I think this means delivery status notifications (DSN) from mail servers) are a similar kind of problem to text-based filtering, though somewhat more stylized.
Okay! So, these error messages are to be notified to the sender in a more user understandable fashion. This is what you meant right?
Steve
-- *Yours Sincerely* * * *Mora Sreyantha Chary* *Computer Engineering '14* *National Institute of Technology Karnataka* *Surathkal, India 575 025*
participants (3)
-
Patrick Ben Koetter
-
Sreyanth
-
Stephen J. Turnbull