Re: [Mailman-Developers] To be more precise on the Web errors...

Dan Mick

Oct. 22, 2001

8:05 p.m.

I still seem OK without that latest MailList.py patch.

Hi,

it seems that accessing via Web the main /mailman/listinfo page I get the error below, but if I access directly one (and only this, it seems) list via /mailman/listinfo/listname all works fine :-) I can perform all the administrative operations! :-) I don't know why only for this list, anyway...

Here is the Traceback for the lists that don't work:

Traceback (most recent call last): File "/home/mailman/scripts/driver", line 96, in run_main main() File "/home/mailman/Mailman/Cgi/admin.py", line 62, in main mlist = MailList.MailList(listname, lock=0) File "/home/mailman/Mailman/MailList.py", line 98, in __init__ self.Load() File "/home/mailman/Mailman/MailList.py", line 531, in Load self.CheckVersion(dict) File "/home/mailman/Mailman/MailList.py", line 548, in CheckVersion self.Lock() File "/home/mailman/Mailman/MailList.py", line 151, in Lock self.Load() File "/home/mailman/Mailman/MailList.py", line 531, in Load self.CheckVersion(dict) File "/home/mailman/Mailman/MailList.py", line 548, in CheckVersion self.Lock() File "/home/mailman/Mailman/MailList.py", line 147, in Lock self.__lock.lock(timeout) File "/home/mailman/Mailman/LockFile.py", line 268, in lock raise AlreadyLockedError AlreadyLockedError:

--luca

Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers

Show replies by date

barry＠zope.com

October 2001

8:28 p.m.

New subject: To be more precise on the Web errors...

I'm a big dummy. MailList.Locked() can't be called on an already locked list. So this should do the trick (apply on top of last patch, or just "cvs up" in a few minutes). -Barry -------------------- snip snip -------------------- Index: MailList.py =================================================================== RCS file: /cvsroot/mailman/mailman/Mailman/MailList.py,v retrieving revision 2.46 diff -u -r2.46 MailList.py --- MailList.py 2001/10/22 19:20:37 2.46 +++ MailList.py 2001/10/22 20:27:08 @@ -545,7 +545,8 @@ self.Load(check_version=0) # We must hold the list lock in order to update the schema waslocked = self.Locked() - self.Lock() + if not waslocked: + self.Lock() try: from versions import Update Update(self, stored_state)

Luca Maranzano

9:01 p.m.

New subject: Ok, it works! :) Re: To be more precise on the Web errors...

Great Barry! :)

It seems that all is OK now, both gate_news and the Web U/I :)

I've still to report this:

after issuing /etc/init.d/mailman start

I got the following to the terminal:

Traceback (most recent call last): File "/home/mailman/Mailman/Archiver/Archiver.py", line 183, in ArchiveMail h.processUnixMailbox(f, HyperArch.Article) File "/home/mailman/Mailman/Archiver/pipermail.py", line 525, in processUnixMailbox a = articleClass(m, self.sequence) File "/home/mailman/Mailman/Archiver/HyperArch.py", line 150, in __init__ self.__super_init(message, sequence, keepHeaders) File "/home/mailman/Mailman/Archiver/pipermail.py", line 210, in __init__ s = StringIO(message.get_payload()) TypeError: expected string, list found

Hoping to be useful :)

--luca

barry＠zope.com

11:18 p.m.

New subject: Ok, it works! :) Re: To be more precise on the Web errors...

...

...
...
...
...
"LM" == Luca Maranzano <liuk@publinet.it> writes:

LM> It seems that all is OK now, both gate_news and the Web U/I :)

LM> I've still to report this:

LM> after issuing /etc/init.d/mailman start

| "/home/mailman/Mailman/Archiver/pipermail.py", line 210, in
| __init__
|     s = StringIO(message.get_payload())
| TypeError: expected string, list found

Fixing this is a bit more complicated than I thought. Watch CVS tonight (hopefully). I might have to spin an alpha4 in the next couple of days though.

-Barry

barry＠zope.com

7:08 a.m.

New subject: Ok, it works! :) Re: To be more precise on the Web errors...

...

...
...
...
...
"LM" == Luca Maranzano <liuk@publinet.it> writes:

LM> Great Barry! :)

LM> It seems that all is OK now, both gate_news and the Web U/I :)

If you're watching the CVS log messages, you might see some checkins to address the problems with Pipermail in 2.1a3. Had an all day meeting today, and I'm beat so I'll email more about it tomorrow, but I think I have a neat solution that will also address Ben's patch to clean attachments out of the archives, and may serve as a basis for a built-in de-mimer.

More tomorrow, er, later today. -Barry

barry＠zope.com

4:48 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

...

...
...
...
...
"BAW" == Barry A Warsaw <barry@zope.com> writes:

BAW> If you're watching the CVS log messages, you might see some
BAW> checkins to address the problems with Pipermail in 2.1a3.
BAW> Had an all day meeting today, and I'm beat so I'll email more
BAW> about it tomorrow, but I think I have a neat solution that
BAW> will also address Ben's patch to clean attachments out of the
BAW> archives, and may serve as a basis for a built-in de-mimer.

So here's the scoop. I've been thinking about Ben Gertzfield's code to sanitize the archives, and I've been mulling about the de-mime stuff. It all came to a head when 2.1a3 broke archiving for multipart messages.

Here's what I've now got in cvs and it seems to work fairly well. Only more testing will tell for sure.

There's a new handler module called Scrubber.py, but it's not in the primary pipeline. Only Pipermail is going to call it, and that via the new mm_cfg.py/Default.py variable ARCHIVE_SCRUBBER.

This module hardcodes the following de-mime decisions:

text/plain parts are passed through unchanged
text/html parts are removed completely. If the outer message is of type text/html then the whole message is discarded (i.e. DiscardMessage is raised).
For all other non-multipart parts, we treat them as "attachments" by pulling the decoded payload out of the message, storing it in a file inside the list's private archive directory (e.g. archives/private/mylist/attachments) and rewriting the payload of the part to include a description of the attachment.

Included in this description is a url to the attachment file, which Pipermail will hyperlink. One drawback here is that if archives are switched from public to private, or vice versa, all the attachment urls will break. But you could re-run bin/arch to regenerate the whole thing -- the key being that Scrubber works only on a copy of the message being prepped for the archiver, /not/ on the message being saved in the mbox.
multiparts are ignored for the first pass, but are recursed to perform the above cleaning.

Then the entire scrubbed message is converted into a flat message, where only the headers are parsed and the body is slurped in one gulp; it isn't parsed recursively. Along the way, we throw out the headers for any internal parts, and we play games with the inter-part boundary strings so they are move useful (yes, this is a kludge).

There's even more kludgery involved to get Pipermail to archive scrubbed message without having to rewrite huge chunks of inscrutable code. But it seems to work.

Now, the interesting thing is that Scrubber.py is written so that it /could/ be used in the main pipeline. E.g. it supports the proper signature and semantics for use in the pipeline. But I'm not adding it there for now primarily because it isn't configurable via the web. All its decisions above are hardcoded because getting the u/i right is more work than I want to do right now.

But if you were interested in mainlining Scrubber.py, here's how you might do it: Add it to GLOBAL_PIPELINE in your mm_cfg.py. I would suggest sticking it after ToArchives so that the mbox gets the original unscrubbed message (this lets you adjust the scrubber's behavior for archive purposes and regenerate from the raw mbox). In fact, what I'd do is move ToArchive to just after the Hold module, and stick Scrubber just between Hold and Tagger. This is untested.

I think this will give us a foothold into providing a cleaner archive with Pipermail, and to experimenting with Mailman supported de-mime-ification. Probably the best that'll happen for MM2.1.

Enjoy, -Barry

Ben Gertzfield

3:01 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

Barry, thanks lots for the great work! I love how the email module has turned out, and the function names you chose really ended up making sense. Anyway, I installed the latest mailman CVS and the email module from the misc/ directory, and successfully created a list with the new install. Here's what happened when I posted a message with a GIF, an HTML part, and a JPEG part (in that order) to the list: [root@nausicaa:/usr/local/mailman/archives/private/test]# ls -l attachments total 16 -rw-rw-rw- 1 qmaild qmail 10404 Oct 26 11:27 attachment-0001.gif -rw-rw-rw- 1 qmaild qmail 29 Oct 26 11:27 attachments.pck [^^^^^^^^ NOTE NOTE NOTE: These arguably should *not* be mode 666. :) ] [root@nausicaa:/usr/local/mailman/archives/private/test]# cat index.html [SNIP] <P>Currently, there are no archives. </P> [SNIP] The offending message is at: http://nausicaa.interq.or.jp/pipermail/test.mbox/test.mbox I found the bug; you put in code to ignore all HTML attachments (whether this is a good idea or not is up to you ;) but the code in Scrubber.process() suggests that you started coding in an 'outer' attachment detection, but didn't fully implement it. Here's a patch that actually throws out all-HTML emails, but just removes HTML parts. Actually, why don't we just decode HTML attachments like any other, and let the user beware if they want to click on it? There are lots of legitimate reasons to allow HTML attachments. I can't think of any to allow all-HTML messages. *grin* We could treat all-HTML messages in the same way, just provide a link and let the user beware if they click on it. The patch also adds a filename to the replacement payload, so that users can have an idea of what they're going to see if a description was not provided (VERY common). Finally (and this part of the patch, I'm not quite sure if it's the right solution), we add http://mlist.host_name to the beginning of the URL returned by Scrubber.save_attachment. Why? Because pipermail sees the string "/pipermail/listname/attachments/attachment-0001.gif" and doesn't (of course) realize it's a URL! The patch does not address the mode 666 issue. I don't know where it should be set, but you need to make the umask set to make these files not be readable/writable by others.. Anyway, the output should be cleaned up a bit, but Barry, this is a great leap forward for Mailman's MIME handling! I'll be working on this more later. Thanks a lot! Index: Scrubber.py =================================================================== RCS file: /cvsroot/mailman/mailman/Mailman/Handlers/Scrubber.py,v retrieving revision 2.1 diff -u -r2.1 Scrubber.py --- Scrubber.py 2001/10/25 04:10:23 2.1 +++ Scrubber.py 2001/10/26 03:00:41 @@ -60,7 +60,6 @@ def process(mlist, msg, msgdata=None): - outer = 1 for part in msg.walk(): # If the part is text/plain, we leave it alone if part.get_type('text/plain') == 'text/plain': @@ -70,7 +69,7 @@ # whole message is HTML, just discard the entire thing. Otherwise, # just add an indication that the HTML part was removed. if part.get_type() == 'text/html': - if outer: + if not msg.is_multipart(): raise DiscardMessage part.set_payload(_("An HTML attachment was scrubbed and removed")) # If the message isn't a multipart, then we'll strip it out as an @@ -82,9 +81,11 @@ size = len(payload) url = save_attachment(mlist, part) desc = part.get('content-description', _('not available')) + filename = part.get_filename(_('not available')) part.set_payload(_("""\ A non-text attachment was scrubbed... Type: %(ctype)s +Name: %(filename)s Size: %(size)d bytes Desc: %(desc)s Url : %(url)s @@ -155,5 +156,5 @@ fp.write(decodedpayload) fp.close() # Now calculate the url - url = mlist.GetBaseArchiveURL() + '/attachments/' + file + ext + url = 'http://' + mlist.host_name + mlist.GetBaseArchiveURL() + '/attachments/' + file + ext return url -- Brought to you by the letters W and B and the number 14. "It should be illegal to yell 'Y2K' in a crowded economy." Debian GNU/Linux maintainer of Gimp and GTK+ -- http://www.debian.org/

David Champion

5:34 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

On 2001.10.25, in <87vgh3ayv3.fsf@nausicaa.interq.or.jp>, "Ben Gertzfield" <che@debian.org> wrote:

...

Here's a patch that actually throws out all-HTML emails, but just removes HTML parts.

Actually, why don't we just decode HTML attachments like any other, and let the user beware if they want to click on it? There are lots of legitimate reasons to allow HTML attachments. I can't think of any to allow all-HTML messages. *grin* We could treat all-HTML messages in the same way, just provide a link and let the user beware if they click on it.

Unfortunately, I think there are legitimate reasons for allowing HTML messages (as well as parts) into the record. But I don't think that legitimizes passing the HTML through literally -- this poses a big potential threat to archive viewers.

I don't care to make a full-blown rendering of HTML; I'd argue that it's not Mailman's job -- but it is Mailman's job (or, more precisely, the archiver's job) to provide any text available to the archive viewer. Whether its display is true to the intentions of the poster is subject to endless debate, but HTML is widely expected to be legible even if it's not rendered per specification -- and it almost always is, if you try hard enough -- so I think that the content should be available.

I suggested transliterating the HTML with < and > tokens, to make it harmless but legible, in case there's significant text inside. But, admittedly, that is pretty ugly. What about simply stripping out ALL markup, leaving only bare text -- and perhaps doing some minor interpretation for <br> and <p> tags, just to improve readability? Then throw in a link to the original, as Ben suggests, for good measure.

...

The patch also adds a filename to the replacement payload, so that users can have an idea of what they're going to see if a description was not provided (VERY common).

Ah, filenames. I'd actually like to see the filename stored on the server as requested in the MIME content-disposition. I don't think the archiver needs to guarantee literalism here; a good-faith effort is sufficient. But I think it's significant in many cases, where the transmission filename is really how the file needs to be saved locally. Minimally I'd like the filename to be shown on the archive display, but it'd be nice if I don't need to change the filename in my browser's "save as..." dialog each time I save an attachment.

I'd suggest a very basic sanitizing of the basename of the MIME filename. Something like s!.*[:/\\]!! to remove pathname components for all three major pathname separators, and then (optionally) to either hex-encode the non-alphanumeric symbols, a la HTML, or to replace them with some other token.

-- -D. dgc@uchicago.edu NSIT University of Chicago

Ben Gertzfield

8:17 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

...

...
...
...
...
"David" == David Champion <dgc@uchicago.edu> writes:

David> Unfortunately, I think there are legitimate reasons for
David> allowing HTML messages (as well as parts) into the
David> record. But I don't think that legitimizes passing the HTML
David> through literally -- this poses a big potential threat to
David> archive viewers.

Sure. Make it an option, then! I suggest decoding and saving HTML messages and attachments, and making them clickable links, so users are only subjected to their horrors if they click on them.

David> Ah, filenames. I'd actually like to see the filename stored
David> on the server as requested in the MIME
David> content-disposition.

Sure, but duplicates will come in quite quickly; it will be pretty useless as soon as 40 people send in "map.gif", don't you think? We'd have to do filename munging in any case, and that's sticky. What do you suggest, prefix the filename with a number if the original filename is taken?

01-map.gif

Ben

-- Brought to you by the letters M and E and the number 16. "Johnny! Don't go! It's too dangerous!" "I don't care!" Debian GNU/Linux maintainer of Gimp and GTK+ -- http://www.debian.org/

David Champion

8:25 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

On 2001.10.26, in <87hesmbysl.fsf@nausicaa.interq.or.jp>, "Ben Gertzfield" <che@debian.org> wrote:

...

Sure, but duplicates will come in quite quickly; it will be pretty useless as soon as 40 people send in "map.gif", don't you think? We'd have to do filename munging in any case, and that's sticky. What do you suggest, prefix the filename with a number if the original filename is taken?

Oh, I forgot to mention that part: make a subdirectory of the attachments/ directory, whose name is unique and preferably based on some characteristic of the message (message-id, sequence number, etc.) combined with something unlikely to repeat (a hash, a sequence number, etc.).

All that message's attachments would go in there.

That might be desirable anyway -- do you suppose a list might see enough attachments and enough activity that all the attachments make readdirs on the attachments/ directory uncomfortably slow?

-- -D. dgc@uchicago.edu NSIT University of Chicago

Ben Gertzfield

8:54 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

...

...
...
...
...
"David" == David Champion <dgc@uchicago.edu> writes:

David> Oh, I forgot to mention that part: make a subdirectory of
David> the attachments/ directory, whose name is unique and
David> preferably based on some characteristic of the message
David> (message-id, sequence number, etc.)  combined with
David> something unlikely to repeat (a hash, a sequence number,
David> etc.).

Hm.. It'd need more testing. Either that or *some* kind of hashing based on the message-id will be necessary on any list with a good amount of traffic with attachments. But yes, that's a good idea and we should test it.

Ben

-- Brought to you by the letters N and L and the number 3. "It should be illegal to yell 'Y2K' in a crowded economy." Debian GNU/Linux maintainer of Gimp and GTK+ -- http://www.debian.org/

Dale Newfield

8:32 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

On Fri, 26 Oct 2001, Ben Gertzfield wrote:

...

David> Ah, filenames. I'd actually like to see the filename stored
David> on the server as requested in the MIME
David> content-disposition.
Sure, but duplicates will come in quite quickly; it will be pretty useless as soon as 40 people send in "map.gif", don't you think? We'd have to do filename munging in any case, and that's sticky. What do you suggest, prefix the filename with a number if the original filename is taken?

01-map.gif

How 'bout if there's a directory created for each message that has separate files, and the files are placed inside that directory?

-Dale

Marc MERLIN

1:12 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

On Fri, Oct 26, 2001 at 04:32:51AM -0400, Dale Newfield wrote:

...

How 'bout if there's a directory created for each message that has separate files, and the files are placed inside that directory?

Just for the record, Sourceforge just hit the 16,000 lists limit today and broke because ext2fs doesn't support more than 32,000 links to a directory (archives/private had 32,000 archive dirs) I'm told freebsd's UFS has similar problems.

Of course, that can be fixed with Residerfs/XFS/name your FS here or archives/private/l/li/listname{,.mbox} but my point was that if you create a directory for each attachment, you're going to hit the 32,000 limit very quickly.

As far as I know, most FS don't handle hundreds of thousands of files in the same dir very well, but at least they handle it.

Marc

Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking

Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

barry＠zope.com

2:43 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

...

...
...
...
...
"MM" == Marc MERLIN <marc_news@valinux.com> writes:

MM> Just for the record, Sourceforge just hit the 16,000 lists
MM> limit today and broke because ext2fs doesn't support more than
MM> 32,000 links to a directory (archives/private had 32,000
MM> archive dirs) I'm told freebsd's UFS has similar problems.

Wow, neat! Not for you, but neat. :)

MM> Of course, that can be fixed with Residerfs/XFS/name your FS
MM> here or archives/private/l/li/listname{,.mbox} but my point
MM> was that if you create a directory for each attachment, you're
MM> going to hit the 32,000 limit very quickly.

Ah, so /that's/ why we have /home/groups/m/ma/mailman... :)

MM> As far as I know, most FS don't handle hundreds of thousands
MM> of files in the same dir very well, but at least they handle
MM> it.

Don't worry, we're not talking about a directory for each attachment, but a directory for each message with an attachment. Hmm, maybe we /should/ worry! 32k messages with attachments sure doesn't seem all that many.

Hmm, have to think about it. -Barry

Nigel Metheringham

10:47 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

On Sat, 2001-10-27 at 03:43, Barry A. Warsaw wrote:

...

Don't worry, we're not talking about a directory for each attachment, but a directory for each message with an attachment. Hmm, maybe we /should/ worry! 32k messages with attachments sure doesn't seem all that many.

Or are the message attachment directories within the normal per-period archive directories? In that case you would have a limit of 16K messages with attachments per archive period and presumably if you have busy lists you decrease the archive period accordingly (ie for the exim lists I rotate the archives weekly).

Nigel.

Marc MERLIN

November 2001

1:34 a.m.

New subject: Having more than 16,000 lists

On Fri, Oct 26, 2001 at 10:43:54PM -0400, Barry A. Warsaw wrote:

...

MM> Just for the record, Sourceforge just hit the 16,000 lists
MM> limit today and broke because ext2fs doesn't support more than
MM> 32,000 links to a directory (archives/private had 32,000
MM> archive dirs) I'm told freebsd's UFS has similar problems.

Wow, neat! Not for you, but neat. :)

BTW, I ended up removing all the HTML archive directories since I've turned off HTML archiving anyway. That gives us a little while (more than six months if we are lucky :-D) before we hit the 32,000 lists mark.

I'm thinking about an optional modification to mailman which only affects list creation that creates the mailman/foo/l/li/listname dirs, and then symlinks all this to mailman/foo/listname Yeah, symlinks aren't great, but the advantage is that it requires no other changes to the mailman code

...

MM> Of course, that can be fixed with Residerfs/XFS/name your FS
MM> here or archives/private/l/li/listname{,.mbox} but my point
MM> was that if you create a directory for each attachment, you're
MM> going to hit the 32,000 limit very quickly.

Ah, so /that's/ why we have /home/groups/m/ma/mailman... :)

Yep :-) (it's also to allow for splitting over more file servers and partitions and dealing with amanda (backup software) very unfortunate limitation of being unable to backup more than a tape's worth of data per partition)

...

MM> As far as I know, most FS don't handle hundreds of thousands
MM> of files in the same dir very well, but at least they handle
MM> it.
Don't worry, we're not talking about a directory for each attachment, but a directory for each message with an attachment. Hmm, maybe we /should/ worry! 32k messages with attachments sure doesn't seem all that many.

We'd hit that pretty quickly on SF.net :-)

Marc

Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking

Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

Marc MERLIN

7:06 p.m.

New subject: Having more than 16,000 lists

On Fri, Nov 09, 2001 at 05:34:47PM -0800, Marc MERLIN wrote:

...

On Fri, Oct 26, 2001 at 10:43:54PM -0400, Barry A. Warsaw wrote:

...
MM> Just for the record, Sourceforge just hit the 16,000 lists
MM> limit today and broke because ext2fs doesn't support more than
MM> 32,000 links to a directory (archives/private had 32,000
MM> archive dirs) I'm told freebsd's UFS has similar problems.
Wow, neat! Not for you, but neat. :)
BTW, I ended up removing all the HTML archive directories since I've turned off HTML archiving anyway. That gives us a little while (more than six months if we are lucky :-D) before we hit the 32,000 lists mark.

I'm thinking about an optional modification to mailman which only affects list creation that creates the mailman/foo/l/li/listname dirs, and then symlinks all this to mailman/foo/listname Yeah, symlinks aren't great, but the advantage is that it requires no other changes to the mailman code

I haven't yet had the time to work on this, but I wanted to know what your preference was between a hack that optionally creates mailman/foo/l/li/listname dirs at the time the list is created, and then sets symlinks back to where the rest of the mailman code expects to see the files and dirs, or a bigger change to teach all of mailman about the new file and dir location (no more symlinks) Of course, if we go with the no symlink route, making the deep subdir config optional would be a bit more work.

What's your take on this, and which route would you prefer?

Marc

Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking

Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

barry＠zope.com

December 2001

4:54 a.m.

New subject: Having more than 16,000 lists

...

...
...
...
...
"MM" == Marc MERLIN <marc_news@valinux.com> writes:

MM> I haven't yet had the time to work on this, but I wanted to
MM> know what your preference was between a hack that optionally
MM> creates mailman/foo/l/li/listname dirs at the time the list is
MM> created, and then sets symlinks back to where the rest of the
MM> mailman code expects to see the files and dirs, or a bigger
MM> change to teach all of mailman about the new file and dir
MM> location (no more symlinks) Of course, if we go with the no
MM> symlink route, making the deep subdir config optional would be
MM> a bit more work.

MM> What's your take on this, and which route would you prefer?

At this point, whatever causes the least disruption of the code base the better. ;)

-Barry

Jay R. Ashworth

October 2001

2:56 p.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

On Fri, Oct 26, 2001 at 12:34:53AM -0500, David Champion wrote:

...

I don't care to make a full-blown rendering of HTML; I'd argue that it's not Mailman's job -- but it is Mailman's job (or, more precisely, the archiver's job) to provide any text available to the archive viewer. Whether its display is true to the intentions of the poster is subject to endless debate, but HTML is widely expected to be legible even if it's not rendered per specification -- and it almost always is, if you try hard enough -- so I think that the content should be available.

You might shell out to Lynx; I believe it has enough switches to render the HTML into ASCII and be restrained from doing anything nasty.

You'd then depend on Lynx, of course, but that doesn't seem too hasslacious.

Cheers, -- jra

Jay R. Ashworth jra@baylink.com Member of the Technical Staff Baylink RFC 2100 The Suncoast Freenet The Things I Think Tampa Bay, Florida http://baylink.pitas.com +1 727 804 5015

"Usenet: it's enough to make you loose your mind." -- me

barry＠zope.com

9:22 p.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

Folks,

Thanks for the really great feedback. I'm about to check in a new version of Scrubber.py that addresses the many issues brought up. Apologies for not quoting everything.

permission problems: fixed
problems with multipart/mixed containing gif, html, and jpeg parts: fixed.
text/html decoding: there's now a new global variable ARCHIVE_HTML_SANITIZER which can be 0, 1, or a string.

This variable defines what happens to text/html subparts. They can be

# stripped completely, escaped, or filtered through an external program. The # legal values are: # 0 - Strip out text/html parts completely, leaving a notice of the removal in # the message. If the outer part is text/html, the entire message is # discarded. # 1 - Remove any embedded text/html parts, leaving them as HTML-escaped # attachments which can be separately viewed. Outer text/html parts are # simply HTML-escaped. # # The value can also be a string, in which case it is the name of a command to # filter the HTML page through. The resulting output is left in an attachment # or as the entirety of the message when the outer part is text/html. The # format of the string must include a "%(filename)s" which will contain the # name of the temporary file that the program should operate on. It should # write the processed message to stdout. ARCHIVE_HTML_SANITIZER = '/usr/bin/lynx -dump %(filename)s'

This seems to work pretty well (will provide examples shortly). As with the rest of Scrubber, it's a bit of a kludge, but perhaps not horrible. It could definitely use more testing by you guys.

It's actually rather difficult to get Pipermail to /not/ HTML-escape attachments, so I'm punting on that for now. Plus, I just feel it's way too dangerous to support.

storing in get_filename() if available: fixed, and I've also implemented the idea of sticking each message's attachments in a separate subdir off of archives/private/mylist/attachments. The subdir is based on the Message-ID: and files inside there are uniquified if necessary.
problems with the attachment url: what we really needed was a more elaborate PUBLIC_ARCHIVE_URL format string. It now accepts %(hostname)s as well as %(listname)s, and the former gets interpolated with the list's web host name (as looked up in the inverted VIRTUAL_HOSTS dictionary, and defaulting to DEFAULT_URL_HOST).

Watch for checkins shortly. -Barry

Ben Gertzfield

1:35 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

...

...
...
...
...
"BAW" == Barry A Warsaw <barry@zope.com> writes:

BAW> - text/html decoding: there's now a new global variable
BAW> ARCHIVE_HTML_SANITIZER which can be 0, 1, or a string.

BAW> ARCHIVE_HTML_SANITIZER = '/usr/bin/lynx -dump %(filename)s'

This is fine, but it's not going to be default, right? I think it should definitely be set to 1 by default. That way, no information is lost. Does it save the payload if the entire message is text/html?

Ben

-- Brought to you by the letters Z and A and the number 3. "Hoosh is a kind of soup." Debian GNU/Linux maintainer of Gimp and GTK+ -- http://www.debian.org/

barry＠zope.com

3:18 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

...

...
...
...
...
"BG" == Ben Gertzfield <che@debian.org> writes:

BG> This is fine, but it's not going to be default, right?  I
BG> think it should definitely be set to 1 by default.  That way,
BG> no information is lost.  Does it save the payload if the
BG> entire message is text/html?

I realize we need 3 default values:

0 - discard 1 - an html-escaped attachment 2 - an html-escaped inline

Unfortunately, #1 and #2 won't escape them the same way. Pipermail's doing #2 and Scrubber's doing #1. Pipermail actually does a better job here, but I've improved Scrubber's default escaping. Still looks ugly as sin to me, so I'd probably just set it to 0 for my lists, but there you have it.

1 will be the default.

-Barry

Marc MERLIN

12:59 a.m.

New subject: New Pipermail hacks (was Re: Ok, it works! ...)

On Thu, Oct 25, 2001 at 12:48:34AM -0400, Barry A. Warsaw wrote:

...

Included in this description is a url to the attachment file, which Pipermail will hyperlink. One drawback here is that if archives are switched from public to private, or vice versa, all the attachment urls will break. But you could re-run bin/arch to regenerate the whole thing -- the key being that Scrubber works only on a copy of the message being prepped for the archiver, /not/ on the message being saved in the mbox.

I have a solution to this: Reference the private URL. Have the private cgi issue a redirect to the public archive URL if it detects that the list has public archives (after parsing config.db)

Your ideas about de-miming and referencing attachments stored on the server are perfect. This will solve many problems I've had and seen.

Thanks Marc

Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking

Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

8452

Age (days ago)

8507

Last active (days ago)

List overview

Download

22 comments

9 participants

participants (9)

barry＠zope.com
Ben Gertzfield
Dale Newfield
Dan Mick
David Champion
Jay R. Ashworth
Luca Maranzano
Marc MERLIN
Nigel Metheringham