Mailman 3 How to wrap text in archived messages - Mailman-Users

newer
Filter Rule precedence, Sender...

How to wrap text in archived messages

older
detecting pending subscriptions...

Mark Dale

May 23, 2022

1:20 a.m.

Hi,

I'm looking for a way to wrap lines in archived messages.

Messages from some mail clients (eg. Gmail) have their lines wrapped to 72 chars in the archived version, while archived messages from others (eg. Thunderbird, Outlook) display unwrapped lines forcing the reader to scroll horizontally.

Looking at the HTML page source -- in both cases (wrapped and unwrapped) I see the message content is enclosed by PRE tags.

<HR>  <PRE> Lorem ipsum dolor sit amet, consectetur ...

Ut enim ad minim veniam, quis nostrud ... </PRE>  <HR>

The template (article.html) contains the following: ... <HR>  %(body)s  <HR>
...

From what I can figure out, the PRE tags come from .../mailman/Mailman/Archiver/HyperArch.py in a block of code lines 1290 to 1314 ...

///////////////////////////////////////////

def format_article(self, article):
    # called from add_article
    # TBD: Why do the HTML formatting here and keep it in the
    # pipermail database?  It makes more sense to do the html
    # formatting as the article is being written as html and toss
    # the data after it has been written to the archive file.
    lines = filter(None, article.body)
    # Handle &lt;HTML> &lt;/HTML> directives
    if self.ALLOWHTML:
        self.__processbody_HTML(lines)
    self.__processbody_URLquote(lines)
    if not self.SHOWHTML and lines:
        lines.insert(0, '&lt;PRE>')
        lines.append('&lt;/PRE>')
    else:
        # Do fancy formatting here
        if self.SHOWBR:
            lines = map(lambda x:x + "&lt;BR>", lines)
        else:
            for i in range(0, len(lines)):
                s = lines[i]
                if s[0:1] in ' \t\n':
                    lines[i] = '&lt;P>' + s
    article.html_body = lines
    return article

////////////////////////////

And the lines in that block that seem responsible for the PRE tags are ...

        lines.insert(0, '&lt;PRE>')
        lines.append('&lt;/PRE>')

My question is: Can those PRE tags be removed and replaced with something equivalent to PHP's "nl2br" (which inserts a line break BR in place of new line entries)?

A Google search for such an equivalent gives me ...

def nl2br(s): return ' \n'.join(s.split('\n'))

With zero understanding of Python my attempts to implement this have failed so far and I may well be barking up the wrong tree completely. Any clues or pointers gratefully received.

Thanks.

Show replies by date

Stephen J. Turnbull

May 2022

7:34 a.m.

Mark Dale via Mailman-Users writes:

...

I'm looking for a way to wrap lines in archived messages.

Executive summary: There's not really a good way to do this. It's extremely complicated, *especially* in email (as opposed to most "normal" text) because of quoting conventions in email.

...

With zero understanding of Python my attempts to implement this have failed so far and I may well be barking up the wrong tree completely. Any clues or pointers gratefully received.

It's not your lack of Python, it's that reliably reformatting email for different formats of email is a *very* hard problem in natural language processing, and requires some knowledge of message user agent internals. And that's why Pipermail punts by just wrapping the whole thing in a PRE element. Works for Mutt users (= Unix email elders).

Gory details follow (because I think it's an interesting problem!)

...

Looking at the HTML page source -- in both cases (wrapped and unwrapped) I see the message content is enclosed by PRE tags.

Right. PRE is not very pretty as HTML goes, but it works OK for all RFC-conforming text/plain email. I assume that that in fact this comes from text/plain parts created by the author's MUA, because the agents that we use to transform a text/html part to text/plain will format to a reasonable width such as 72 characters.

...

And the lines in that block that seem responsible for the PRE tags are ...
        lines.insert(0, '&lt;PRE>')
        lines.append('&lt;/PRE>')
        
My question is: Can those PRE tags be removed and replaced with something equivalent to PHP's "nl2br" (which inserts a line break BR in place of new line entries)?

No, because there *are no* newlines to break those very long lines. These MUAs use newline to mean "paragraph break", not "line break".

You might get a better result in these messages by removing the "PRE" tags, and wrapping each line with "...", but that's a real hack, and almost certain to make RFC-conforming email look quite ugly, because every line becomes a paragraph, and you'll lose all indentation. Eg, in the code blocks you posted, all the lines will end up flush left. If your members are posting code or poetry, or using indented block quotations etc, they're likely to be extremely unhappy with the result.

Python's standard library does have a textwrap module, but I'm not at all sure it's suitable for this. If you know that the long lines of a message are actually paragraphs, you can use something like

from textwrap import wrap
# work backward because wrapping changes indicies of later lines
for i in range(len(lines) - 1, -1, -1):
    # NDT = detect_prefix(lines[i])
    lines[i:i+1] = wrap(lines[i], initial_indent=NDT, subsequent_indent=NDT)

If a line is indented or has a quoting prefix, you have to detect that for yourself and set NDT to that prefix. Something like

import re
prefix_re = re.compile('[ >]*')
def detect_prefix(line):
    m = prefix_re.match(line)
    return m.group(0)

should capture most indentation and quoting prefixes, but there are other conventions.

Whether you use P elements or the textwrap module, it's probably a good idea to find out how long the long lines are, and what percentage of the message they are, and avoid trying to wrap a message that looks like it "mostly" has lines of reasonable length. If you don't, and your target is the old "typewriter standard" width of 66, and somebody using an RFC-conforming MUA just prefers 72, you'll reformat their mail into alternating lines of about 60 characters and 10 characters. Yuck ...

Which of the above would work better for you depends a lot on the typical content of your list. But issues with quoting and indentation are likely to have you tearing your hair out.

Steve

Mark Dale

12:13 p.m.

...

...
I'm looking for a way to wrap lines in archived messages....

...

...
And the lines in that block that seem responsible for the PRE tags are ...
        lines.insert(0, '&lt;PRE>')
        lines.append('&lt;/PRE>')
        
My question is: Can those PRE tags be removed and replaced with something equivalent to PHP's "nl2br" (which inserts a line break BR in place of new line entries)?

...

No, because there *are no* newlines to break those very long lines. These MUAs use newline to mean "paragraph break", not "line break".

But there are "newlines" and there isn't any need to insert linebreaks into those long lines -- they just need to wrap.

Looking at, for example, a message that originally has 3 paragraphs of text:

%(body)s

<PRE> First paragraph that is a really long line of text.

Second paragraph that is a really long line of text.

Third paragraph that is a really long line of text. </PRE>

If those PRE tags are removed, then all 3 lines get joined up and displayed as one continuous line. It solves the non-wrap problem but it looses the "paragraphs". So that's a no go.

As said, if this was PHP we could use the "nl2br" function - which inserts line breaks before all newlines.

<?= nl2br($body) ?> -- would give us ...

///////////  First paragraph of body that is a really long line of text. Second paragraph of body that is a really long line of text. Third paragraph of body that is a really long line of text.  ////////

The BR tags would preserve the space between the lines and give the appearance of paragraphs in the HTML Pipermail archive page. Granted the HTML would not be strictly kosher, but then neither are the PRE tags strictly kosher as they're are not being used as they should be. The main thing is that the lines would wrap according the to width of the window and eliminate the need for horizontal scrolling.

...

You might get a better result in these messages by removing the "PRE" tags, and wrapping each line with "...", but that's a real hack, and almost certain to make RFC-conforming email look quite ugly, because every line becomes a paragraph, and you'll lose all indentation. Eg, in the code blocks you posted, all the lines will end up flush left. If your members are posting code or poetry, or using indented block quotations etc, they're likely to be extremely unhappy with the result.

Agreed. To horrible to even think about.

...

Python's standard library does have a textwrap module, but I'm not at all sure it's suitable for this. If you know that the long lines of a message are actually paragraphs, you can use something like
from textwrap import wrap
# work backward because wrapping changes indicies of later lines
for i in range(len(lines) - 1, -1, -1):
    # NDT = detect_prefix(lines[i])
    lines[i:i+1] = wrap(lines[i], initial_indent=NDT, subsequent_indent=NDT)

This is sort of where I was looking to go, but as you've pointed out, there's no telling if the text will be in paragraphs, code blocks etc.

Does the Python code snippet that I mentioned ...

def nl2br(s): return ' \n'.join(s.split('\n'))

... make any sense as a Python equivalent of PHP's "nl2br" function (to accomplish the insertion of the BR line break tags)?

Cheers, Mark

Mark Dale

12:26 p.m.

...

...
You might get a better result in these messages by removing the "PRE" tags, and wrapping each line with "...", but that's a real hack, and almost certain to make RFC-conforming email look quite ugly, because every line becomes a paragraph, and you'll lose all indentation. Eg, in the code blocks you posted, all the lines will end up flush left. If your members are posting code or poetry, or using indented block quotations etc, they're likely to be extremely unhappy with the result.

D'oh! I just saw the error in my whole way of thinking.

And of course, any such "nl2br" equivalent will do exactly the same as wrapping with P tags -- with everything left aligned.

But thank you Steve, for taking the time and trouble to explain. It is indeed a whole can of worms. Much to learn.

/Mark

//////////////////////////////////////////////////////////////////////////

Python's standard library does have a textwrap module, but I'm not at all sure it's suitable for this. If you know that the long lines of a message are actually paragraphs, you can use something like

from textwrap import wrap
# work backward because wrapping changes indicies of later lines
for i in range(len(lines) - 1, -1, -1):
    # NDT = detect_prefix(lines[i])
    lines[i:i+1] = wrap(lines[i], initial_indent=NDT, subsequent_indent=NDT)

If a line is indented or has a quoting prefix, you have to detect that for yourself and set NDT to that prefix. Something like

import re
prefix_re = re.compile('[ >]*')
def detect_prefix(line):
    m = prefix_re.match(line)
    return m.group(0)

should capture most indentation and quoting prefixes, but there are other conventions.

Which of the above would work better for you depends a lot on the typical content of your list. But issues with quoting and indentation are likely to have you tearing your hair out.

//////////////////////////////////////////////////////////////////////////

Stephen J. Turnbull

12:25 p.m.

Mark Dale via Mailman-Users writes:

...

And of course, any such "nl2br" equivalent will do exactly the same as wrapping with P tags -- with everything left aligned.

Right.

I'm not sure that we couldn't do better nowadays with libraries that will handle the same DOM that browsers do, but it certainly wasn't possible in 1994. And even with those libraries it would require a complete rearchitecture of the archiver.

Steve

Mark Dale

2:32 a.m.

...

Mark Dale via Mailman-Users writes:

...
And of course, any such "nl2br" equivalent will do exactly the same as wrapping with P tags -- with everything left aligned.

...

From: Stephen J. Turnbull [mailto:stephenjturnbull@gmail.com] Right.

I'm not sure that we couldn't do better nowadays with libraries that will handle the same DOM that browsers do, but it certainly wasn't possible in 1994. And even with those libraries it would require a complete rearchitecture of the archiver.

Steve

I did some tests with Sympa (v6.2.88) which uses MHonArc (v2.6.19): it does a pretty good job of rendering archived HTML messages with things like lists, code, etc -- and eliminating the need for the horizontal scrolling.

That got me Googling for How-to's on integrating MHonArc with Mailman. There's a fair bit of conversation around this from days long ago, and a patch for using MHonArc written by Mark S. back in 2014.

Before I go down this rabbit hole: was there any particular reason (back in the day) that Pipermail was favoured (and implemented) over MHonArc.

Mark Sapiro

4:12 a.m.

On 5/25/22 19:32, Mark Dale via Mailman-Users wrote:

...

That got me Googling for How-to's on integrating MHonArc with Mailman. There's a fair bit of conversation around this from days long ago, and a patch for using MHonArc written by Mark S. back in 2014.

I didn't write that patch. It's from Richard Barrett who also created a patch for HtDig integration for archive searches. There are three branches at https://code.launchpad.net/~msapiro/mailman/mhonarc, https://code.launchpad.net/~msapiro/mailman/htdig and https://code.launchpad.net/~msapiro/mailman/htdig_mhonarc which are up to date with https://code.launchpad.net/~mailman-coders/mailman/2.1 with the mhonarc, htdig and both patches applied respectively. I never used the mhonarc or htdig_mhonarc branches, but I did use the htdig branch for a production Mailman 2.1 installation.

...

Before I go down this rabbit hole: was there any particular reason (back in the day) that Pipermail was favoured (and implemented) over MHonArc.

Mailman was initially implemented by John Viega in the mid 1990s to manage a mailing list for fans of the Dave Mathews Band. I don't know why pipermail was chosen, but MHonArc was fairly new at that time and pipermail was probably more mature.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Stephen J. Turnbull

5:37 a.m.

Mark Sapiro writes:

...

On 5/25/22 19:32, Mark Dale via Mailman-Users wrote:

...

...
Before I go down this rabbit hole: was there any particular reason (back in the day) that Pipermail was favoured (and implemented) over MHonArc.

Mailman was initially implemented by John Viega in the mid 1990s to manage a mailing list for fans of the Dave Mathews Band. I don't know why pipermail was chosen, but MHonArc was fairly new at that time and pipermail was probably more mature.

I doubt that Hypermail (the original code) was more mature than MHonArc, but it was written in Python and so considered more appropriate for bundling with Mailman. "Batteries included" has always been a goal for Python, and Mailman inherited it. MHonArc is a Perl script, written in an idiosyncratic style (so I was told by a Perlmonger), and I myself would have to punt on most maintenance questions for MHonArc despite admining lists using it for a decade.

I can't speak for other members of the team, they may be more capable with Perl than I am. But to me, this would be the main issue with using MHonArc -- you'd likely be dependent on somebody other than us if you need help with it. (Probably you could get some help on this list, though.)

As far as I can recall, Pipermail never been especially recommended over 3rd party solutions like MHonArc or external services like mailarchive.com, it's just easy to use because it's guaranteed to be there, we provide support for it, and it turned out to be "good enough" for an awful lot of lists.

Regards, Steve

Mark Dale

1:09 a.m.

...

...
...
Before I go down this rabbit hole: was there any particular reason (back in the day) that Pipermail was favoured (and implemented) over MHonArc.

Mailman was initially implemented by John Viega in the mid 1990s to manage a mailing list for fans of the Dave Mathews Band. I don't know why pipermail was chosen, but MHonArc was fairly new at that time and pipermail was probably more mature.

...

I doubt that Hypermail (the original code) was more mature than MHonArc, but it was written in Python and so considered more appropriate for bundling with Mailman. ... As far as I can recall, Pipermail never been especially recommended over 3rd party solutions like MHonArc or external services like mailarchive.com, it's just easy to use because it's guaranteed to be there, we provide support for it, and it turned out to be "good enough" for an awful lot of lists.

Many thanks Steve and Mark.

David Andrews

2:54 a.m.

At 11:12 PM 5/25/2022, Mark Sapiro wrote:

...

On 5/25/22 19:32, Mark Dale via Mailman-Users wrote:

...
That got me Googling for How-to's on integrating MHonArc with Mailman. There's a fair bit of conversation around this from days long ago, and a patch for using MHonArc written by Mark S. back in 2014.

I didn't write that patch. It's from Richard Barrett who also created a patch for HtDig integration for archive searches. There are three branches at https://code.launchpad.net/~msapiro/mailman/mhonarc, https://code.launchpad.net/~msapiro/mailman/htdig and https://code.launchpad.net/~msapiro/mailman/htdig_mhonarc which are up to date with https://code.launchpad.net/~mailman-coders/mailman/2.1 with the mhonarc, htdig and both patches applied respectively. I never used the mhonarc or htdig_mhonarc branches, but I did use the htdig branch for a production Mailman 2.1 installation.

For what it is worth, a number of years ago I installed the HtTDIG patch, and it worked. I must say, also, that I have a cPanel installation. Search worked, and users liked it, but I had another problem with cPanel, don't even remember what. The cPanel folks said they would not give me technical support as long as I used HTDIG, so now I have no search, a problem with a system with over 300 lists.

Dave

Stephen J. Turnbull

June 2022

8:25 a.m.

David Andrews writes:

...

. Search worked, and users liked it, but I had another problem with cPanel, don't even remember what. The cPanel folks said they would not give me technical support as long as I used HTDIG,

Condolences. It's not clear to me that that is any of their business (unless you need tech support for htdig itself, in which case I can't really get upset with them, YMMV), though.

...

so now I have no search, a problem with a system with over 300 lists.

I don't know what your situation is (eg, private lists), but for your public or private-but-not-sensitive lists, mail-archive.com[1] might be an option. Somewhat inconvenient to have to switch back and forth for full archives (they strip images and most other attachments) vs. search capability, but better than nothing.

Unfortunately, it's a hobby site, so I doubt they provide any service at all for private lists. At least their FAQ is of extremely high quality, I think you can probably trust them to be quite up to the task for you, if you have lists with minimal requirements for privacy, and don't need attachments stored and archived there.

Steve

Footnotes: [1] mailarchive.com seems to live in Poland, and otherwise I couldn't read their page. :^)

988

Age (days ago)

997

Last active (days ago)

List overview

Download

10 comments

4 participants

participants (4)

David Andrews
Mark Dale
Mark Sapiro
Stephen J. Turnbull