[Bug 1643210] [NEW] 'from_is_list' does not RFC2047 encode correctly when translation contains non-ascii char

Public bug reported:
If from_is_list feature is used, From: header's `realname' field is composed by original realname and translation of '%(realname)s via %(lrn)s' which may contain non-ascii character.
The realname field is encoded before compose if nessesary, but translation part is not. So From header may contain raw non-ascii character.
To fix this, do RFC 2047 encode after compose.
(There is another bug..., if servers language setting and mail list preferred language is differ, translation has taken from servers language, not from mail list one. Attached patch contains fix of it)
** Affects: mailman Importance: Undecided Status: New
** Attachment added: "CookHeaders.py.diff.txt" https://bugs.launchpad.net/bugs/1643210/+attachment/4780159/+files/CookHeade...
** Branch linked: lp:mailman/2.1

My #1 patch try to adjust to charset/encoding list's preferred. But after I realize my misunderstanding, it is better to adjust to sender's preference. Anyways original senders `realname' charset and translation of '%(realname)s via %(lrn)s' charset and list's preference can differ each other.
** Attachment added: "CookHeaders.py.diff.txt" https://bugs.launchpad.net/mailman/+bug/1643210/+attachment/4780389/+files/C...

I'm having trouble understanding what the problem is. There are 3 possible sources of realname. In order of preference, the display name in the message's From: header if any; if not and the From: address is a list member, the list member's username if any, and if not, the local part of the sender's email address.
In the first case, the real name should already be RFC 2047 encoded in the incoming message and that will be the resultant value in the munged From: header. In the other cases if the name contains non-ascii, it will be RFC 2047 encoded in the character set of the list's preferred language (or maybe utf-8 if the real name is a unicode).
It seems all this should be OK.
Please provide and actual From: header and possibly relevant list settings that illustrate the problem.
Or is the issue that the list's real_name (lrn in the code)is not rfc 2047 encoded. I see that, but I think the fix is simply to replace the one line
lrn = mlist.real_name
with
lrn = str(uheader(mlist, mlist.real_name))
** Changed in: mailman Status: New => Incomplete

I see some issues with the simple 'fix' I suggested above. Namely the translation of 'via' is not RFC 2047 encoded and there would probably be missing whitespace issues due to mixing of text and RFC 2047 encoded words, but it still seems to me that something like the attached should do.
** Attachment added: "Suggested patch." https://bugs.launchpad.net/mailman/+bug/1643210/+attachment/4780663/+files/C...

Further testing of my suggested patch shows it doesn't work in all cases. I'll continue to look at this.

Suppose sender is a member who set preference language ja, the list language is french and original From: is 'From: =?UTF-8?B?5LqM5pyoIOmdluS7gQ==?= futatuki@poem.co.jp'. The translation of '%(realname)s via %(lrn)s' is taken from sender's language context, ja, and its charset is euc-jp. The display name of original from cannot encode to iso-8859-1 entirely and translation of 'via' part is miss interpreted as iso-8859-1. For the latter problem, we should encode it to encoding of sender's preference language (at least the translation of 'via' part).
I don't want to break the realname even if it cannot be encoded to his/her preferred language's encoding, so I select to abandon to translate 'via' part for simple fix. This will occur in language settings that of charset/encoding is not UTF-8.

The lrn part is always us-ascii so it is vallid string in all encodings that currently supported by Mailman.

It seems my patch in #2 also doesn't work if sender is not a list member.

I need to look at this further, and I won't have time for a day or two, but I will. Note however that the sender's display name in the message is already RFC 2047 encoded in some character set which may not be either of the character sets of the list's preferred language or the sender's preferred language if the sender is even a list member. I will be looking at this further when I have time. I appreciate your input and we will get it right.

I have attached the results of my latest effort. For the From:, this gets the i18n translation of the '%(realname)s via %(lrn)s' string with dummy substitutions, converts it to unicode and substitutes unicode values for the substitutions and encodes it all as utf-8. I have tested this to some extent and I think it's good, but I would appreciate additional testing. Please try this and see if it works in your environment or report any issues.
As far as the encoding of _('(no subject)') is concerned, my change ensures this is translated in the list's language, not the poster's.
Again, thanks for your help with this.
** Attachment added: "Another attempt at a fix." https://bugs.launchpad.net/mailman/+bug/1643210/+attachment/4782228/+files/C...
** Changed in: mailman Status: Incomplete => In Progress
** Changed in: mailman Assignee: (unassigned) => Mark Sapiro (msapiro)

Thank you for your better fix. The fix in #10 also works fine for my environment, except a small issue that it always encodes non-ascii to UTF-8 even if sender's preferred language is same as list's but its encoding is not UTF-8.
A test case. list's language : fr (iso-8859-1) sender's language : fr (iso-8859-1) sender's display name : =?iso-8859-1?q?G=E9n=E9rales?= (results) From: =?utf-8?q?G=C3=A9n=C3=A9rales_via_Mailman-test?= <...>
Another case. list's language : ja (euc-jp, out going messages are encoded to iso-2022-jp) sender's language : ja (euc-jp, out going messages are encoded to iso-2022-jp) sender's display name : =?ISO-2022-JP?B?GyRCRnNMWkx3P04bKEI=?= (results) From: =?utf-8?b?5LqM5pyo6Z2W5LuBIChNYWlsbWFuLXRlc3Qg57WM55SxKQ==?= <...>
It seems to be no problem for almost all MUAs nowadays except some l10n MUAs (and those MUAs will treat such encoded strings as raw ascii string, as discribed in RFC, so I think the problem is small).

How about using "dn = str(Header(uvia, lcs))" instead of "dn = str(Header(uvia, 'utf-8'))" ? As variable uvia is always unicode, there is no afraid to be mistaken encodings. Header() treats charset parameter only for a hint, so it uses 'utf-8' as the fall back if it fail to encode to lcs.
test case 1. list's language : fr (iso-8859-1) sender's language : fr (iso-8859-1) sender's display name : =?iso-8859-1?q?G=E9n=E9rales?= (results) From: =?iso-8859-1?q?G=E9n=E9rales_via_Mailman-test?= <...>
test case 2. list's language : ja (euc-jp, out going messages are encoded to iso-2022-jp) sender's language : ja (euc-jp, out going messages are encoded to iso-2022-jp) sender's display name : =?ISO-2022-JP?B?GyRCRnNMWkx3P04bKEI=?= (results) From: =?iso-2022-jp?b?GyRCRnNMWkx3P04bKEIgKE1haWxtYW4tdGVzdCAbJEI3UE0zGyhCKQ==?= <...>
test case 3. list's language : en (us-ascii) sender's language : en (us-ascii) sender's display name : Yasuhito FUTATSUKI (results) From: Yasuhito FUTATSUKI via Mailman-test <...>
test case 4. list's language : fr (iso-8859-1) sender's language : ja (euc-jp, out going messages are encoded to iso-2022-jp) sender's display name : =?UTF-8?B?5LqM5pyoIOmdluS7gQ==?= (results) From: =?utf-8?b?5LqM5pyoIOmdluS7gSB2aWEgTWFpbG1hbi10ZXN0?= <...>
in all of above, it looks fine.

This is an area where there is no one right answer. I have the display name as a unicode, so how do I encode it for the header. I don't think it should ever be encoded in the character set of the poster's language because this is for a message to be sent to all list members, plus there is no guarantee that the poster's display name as encoded by the sending MUA was even encoded in Mailman's charset for the poster's language if the poster is even a member, and there is no guarantee that the translation of the 'via' can even be properly encoded in the charset of the poster's language.
Further, there is no guarantee that the poster's display name can be properly encoded in the charset of the list's preferred language either.
The most reasonable encoding of unicode that guarantees no loss of information is utf-8, and any MUA that recognizes RFC 2047 encodings at all should be able to handle utf-8 encodings.
Even if there are MUA's that can properly decode RFC 2047 encodings in, e.g., iso-2022-jp but not utf-8, I think there are as many problems with trying to encode the original display name in the list's charset as there are with utf-8 encoding. I recognize that what I've done is a compromise, but I think it's as good as any.

I wrote https://bugs.launchpad.net/mailman/+bug/1643210/comments/13 before I saw https://bugs.launchpad.net/mailman/+bug/1643210/comments/12. It looks like your suggestion is good. I'll investigate that.

test case 5. list's language : fr (iso-8859-1) sender's language : ja (euc-jp, out going messages are encoded to iso-2022-jp) sender's display name: =?ISO-2022-JP?B?GyRCRnNMWkx3P04bKEI=?= (results) From: =?utf-8?b?5LqM5pyo6Z2W5LuBIHZpYSBNYWlsbWFuLXRlc3Q=?= <...>
is a rational results, I think.

I have committed a fix which is essentially https://bugs.launchpad.net/mailman/+bug/1643210/+attachment/4782228/+files/C... with "dn = str(Header(uvia, lcs))" as suggested at https://bugs.launchpad.net/mailman/+bug/1643210/comments/12.
Thanks very much to Yasuhito FUTATSUKI for the report and all the helpful suggestions.
** Changed in: mailman Importance: Undecided => High
** Changed in: mailman Status: In Progress => Fix Committed
** Changed in: mailman Milestone: None => 2.1.24

** Changed in: mailman Status: Fix Committed => Fix Released
participants (2)
-
Mark Sapiro
-
Yasuhito FUTATSUKI@POEM