Non-ascii characters missing from Pipermail archive txt and gz downloads

Apologies for double posting. I sent this to the MM3 list by mistake earlier.
Mailman 2.1.34 Debian 10 Postfix
Hi
I'm hoping someone can shine a light on character encoding issue I've encountered.
A plain-text email with non-ascii characters in the body gets posted to the list.
As per Mark Sapiro's guide I've captured the incoming message to file.
The message is received by Mailman with the non-ascii characters displaying correctly.
The header of that message has:
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-AU Content-Transfer-Encoding: 8bit
In the list's mbox file and archive webpage, the message displays the non-ascii characters correctly.
In the archive's downloaded .txt (and also .gz) file, the non-ascii characters are missing and displayed as "?".
I've copied the message text in below, from both the correct one from the email and the erroneous .txt file. Hopefully they won't get scrambled up when I send this.
Any advice on getting the non-ascii characters written into the archive .txt file would be gratefully received.
Thanks, Mark
=== Message text as okay in mbox and as shown on the archive webpage ===
If one goes by the definition of veḷippaṭai as given in the Tamil Lexicon that the meaning of an ambiguous word should be disambiguated by a qualifying word, then aruvi āmpal does not conform to that definition since in the case of aruvi āmpal in Patiṟṟuppattu 63, aruvi is really made up of aru+vi, a compound. Moreover, the expression aṭai aṭuppu aṟiyā is already there to clarify that āmpal is a number and not a flower. Thus, aruvi simply provides information in addition to aṭai aṭuppu aṟiyā that āmpal is not a flower. The modern commentator Aruḷampalavaṉār also does not call it veḷippaṭai.
===
=== Message text with missing characters in te archive's txt and gz downloads ==
If one goes by the definition of ve?ippa?ai as given in the Tamil Lexicon that the meaning of an ambiguous word should be disambiguated by a qualifying word, then aruvi ?mpal does not conform to that definition since in the case of aruvi ?mpal in Pati??uppattu 63, aruvi is really made up of aru+vi, a compound. Moreover, the expression a?ai a?uppu a?iy? is already there to clarify that ?mpal is a number and not a flower. Thus, aruvi simply provides information in addition to a?ai a?uppu a?iy? that ?mpal is not a flower. The modern commentator Aru?ampalava??r also does not call it ve?ippa?ai.
===

On 4/9/21 5:55 AM, Mark Dale via Mailman-Users wrote:
In the archive's downloaded .txt (and also .gz) file, the non-ascii characters are missing and displayed as "?".
...
Any advice on getting the non-ascii characters written into the archive .txt file would be gratefully received.
The message is prepared for the .txt file by the Article.as_text()
method in HyperArch.py
<https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/Mailman/...>.
In order to do the email address obfuscation in the message body,
whether or not ARCHIVER_OBSCURES_EMAILADDRS is True, the method first
converts the body to unicode using the charset of the list's language
and then after possible obfuscation, converts it back, again using the
charset of the list's language. Both these conversions use
errors=replace
which replaces any characters not in the charset with,
in the case of ascii, ?
.
One way to avoid this replacement would be to change the charset for English from ascii to utf-8. See <https://wiki.list.org/x/15958250>.
This isn't a complete solution in the case where the non-ascii
characters are encoded other than utf-8
, e.g., iso-8859-1
, in the
original message, but will probably handle most cases
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

-------- Original Message -------- From: Mark Sapiro [mailto:mark@msapiro.net] Sent: Friday, April 9, 2021, 19:07 UTC
On 4/9/21 5:55 AM, Mark Dale via Mailman-Users wrote:
In the archive's downloaded .txt (and also .gz) file, the non-ascii characters are missing and displayed as "?".
...
Any advice on getting the non-ascii characters written into the archive .txt file would be gratefully received.
The message is prepared for the .txt file by the Article.as_text() method in HyperArch.py <https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/Mailman/...>. In order to do the email address obfuscation in the message body, whether or not ARCHIVER_OBSCURES_EMAILADDRS is True, the method first converts the body to unicode using the charset of the list's language and then after possible obfuscation, converts it back, again using the charset of the list's language. Both these conversions use
errors=replace
which replaces any characters not in the charset with, in the case of ascii,?
.One way to avoid this replacement would be to change the charset for English from ascii to utf-8. See <https://wiki.list.org/x/15958250>.
This isn't a complete solution in the case where the non-ascii characters are encoded other than
utf-8
, e.g.,iso-8859-1
, in the original message, but will probably handle most cases
Hi Mark,
Thank you for the comprehensive explanation of the process.
I haven't made any headway with the suggested solution of modifying the mm_cfg.py file.
The author says: "The one known downside of doing this is that Python's email library which is used by Mailman will base64 encode charset=utf-8 message bodies which makes the raw message body impossible to read by eye or search with simple tools like grep." -- which, on reading, had me thinking I will be jumping from the frying pan into the fire.
However, in the spirit of things, I made the addition to the mm_cfg.py and ...
As a example, using a subscriber's name that appears in the archive.
François -- as seen in the mbox and Pipermail web archive: the cedille is displayed correctly.
Fran?ois -- as seen in the normal downloaded txt: the cedille is replaced by question mark (as expected).
François -- as seen in the mm_cfg modified download txt: the cedille replace by odd characters.
In short, no joy.
So I'm thinking that if the part of HyperArch.py that does the email address obfuscation (and back again) is removed, would that be a step in the direction I want to go?
My Python foo is way less than zero but I'm looking at lines 563 -- 600. Or is my thinking completely bonkers?
Regards, Mark

On 4/19/21 10:43 PM, Mark Dale via Mailman-Users wrote:
François -- as seen in the mm_cfg modified download txt: the cedille replace by odd characters.
How are you viewing the .txt file? The two bytes C3 A7 are the utf-8 representation of the c-cedilla character. If you view that file as iso-8859-1 (latin-1 or western) encoding, you will see those two bytes as ç, but if you view it as uf-8 encoding, you will see the c-cedilla.
In short, the file contains just what it should, but there is a Content-Transfer-Encoding issue. If you are viewing it in a browser, the issue is the default content character set in your web server. For example with Apache something like
AddCharset utf-8 .txt
will do what you want, or perhaps your browser has a selection. E.g., Firefox has a text encoding selection in the View menu and you want Unicode, not Western.
If you are actually downloading the file and viewing it with something else, the issue is with whatever you are viewing it with.
In short, no joy.
So I'm thinking that if the part of HyperArch.py that does the email address obfuscation (and back again) is removed, would that be a step in the direction I want to go?
My Python foo is way less than zero but I'm looking at lines 563 -- 600. Or is my thinking completely bonkers?
That won't help. As I said, the file is no correct and no unrecognized characters have been replaced, so modifying that code by say deleting lines 587-599 won't change anything.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

-------- Original Message -------- From: Mark Sapiro [mailto:mark@msapiro.net] Sent: Tuesday, April 20, 2021, 18:55 UTC
On 4/19/21 10:43 PM, Mark Dale via Mailman-Users wrote:
François -- as seen in the mm_cfg modified download txt: the cedille replace by odd characters.
How are you viewing the .txt file? The two bytes C3 A7 are the utf-8 representation of the c-cedilla character. If you view that file as iso-8859-1 (latin-1 or western) encoding, you will see those two bytes as ç, but if you view it as uf-8 encoding, you will see the c-cedilla.
In short, the file contains just what it should, but there is a Content-Transfer-Encoding issue. If you are viewing it in a browser, the issue is the default content character set in your web server. For example with Apache something like
AddCharset utf-8 .txt
will do what you want, or perhaps your browser has a selection. E.g., Firefox has a text encoding selection in the View menu and you want Unicode, not Western.
If you are actually downloading the file and viewing it with something else, the issue is with whatever you are viewing it with.
... Uh-oh ... <sheepish grin> ... you're right that the issue is what I'm viewing it with.
Just to clarify: there are two .txt files ...
(A) an archive .txt.gz file before I made the change to the mm_cfg.py file; and (B) an archive .txt.gz file after I made the change.
I followed the txt.gz link on the Pipermail page and got the options to "Download" or "Open file".
FAIL -- On choosing the "Open file" with ArchiveManager/JEdit, File-A showed the c-cedilla replaced by the question-mark; and File-B showed it replaced by the ç characters.
SUCCESS -- However, choosing "Download", then gunzip and then open with JEdit I get a better result: File A showed the c-cedilla replaced by the question-mark as expected; but File-B shows the c-cedilla (happy days!).
So in short, Mark Sapiro's recommended fix -- https://wiki.list.org/x/15958250 -- has cracked this little chestnut.
Many thanks once again Mark.

On 4/20/21 5:20 PM, Mark Dale via Mailman-Users wrote:
Just to clarify: there are two .txt files ...
(A) an archive .txt.gz file before I made the change to the mm_cfg.py file; and (B) an archive .txt.gz file after I made the change.
Slightly off topic, but after the cron/nightly_gzip job runs, the .txt.gz file will be updated with the contents from the .txt file.
However, the point of this post is to point out that the .txt.gz files are an anachronism from the days when the bit of bandwidth saved by delivering a compressed version was important to more that a few ancient curmudgeons like me.
These days, the bandwidth savings is unimportant and is probably offset by the redundant storage and processing for the .txt.gz files.
If you want to get rid of these files, see <https://wiki.list.org/x/17892086>.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

-------- Original Message -------- From: Mark Sapiro [mailto:mark@msapiro.net] Sent: Wednesday, April 21, 2021, 01:53 UTC
Slightly off topic, but after the cron/nightly_gzip job runs, the .txt.gz file will be updated with the contents from the .txt file.
However, the point of this post is to point out that the .txt.gz files are an anachronism from the days when the bit of bandwidth saved by delivering a compressed version was important to more that a few ancient curmudgeons like me.
These days, the bandwidth savings is unimportant and is probably offset by the redundant storage and processing for the .txt.gz files.
If you want to get rid of these files, see <https://wiki.list.org/x/17892086>.
Thank you Mark, that information is appreciated and I've made the change.

A bit OT, I'm glossing Mark Sapiro's explanation of compressed file handling in Mailman archive downloads.
Mark Dale via Mailman-Users writes:
Thank you Mark, that information is appreciated and I've made the change.
I'm glad you find it useful. Note that the story is a little more subtle than Mark Sapiro makes it here:
However, the point of this post is to point out that the .txt.gz files are an anachronism from the days when the bit of bandwidth saved by delivering a compressed version was important to more that a few ancient curmudgeons like me.
These days, the bandwidth savings is unimportant and is probably offset by the redundant storage and processing for the .txt.gz files.
In fact, most modern systems will negotiate compressed streams, so if you provide a .txt to your webserver, the client will tell the server "hey, I know how to gunzip", the server will automatically gzip, the client gunzip, and you know nothing about it except that you have text onscreen.
It's unclear what the system will do if offered a .txt.gz file. If the server is smart, it might say
Content-Type: text/plain; name=whatever.txt <-- note: no .gz
Content-Transfer-Encoding: gzip
and the end result is as above. But it's not obviously a good idea for the server to second-guess the admin that way.
It's plausible that if the server just sends it as a binary, the client will say, "oh, they gzipped it on purpose, I should treat it as a file and save it", or it might say, "I know what a .txt is, and go ahead and transparently ungzip it. Clients are reliably unreliable as a class -- some users want Do What I Mean, some what Do What I Say, and different clients will cater to different users.
Bottom line: if you're sure you want your .txt files treated as plain text and displayed as conveniently as possible, ungzip them! Very likely you won't use any more bandwidth (and by the way, modern servers tend to cache that gzipped blob in case somebody asks for it again, so on-the-fly compression doesn't necessarily waste hours of CPU).
If for some reason you'd prefer that they be gzipped at both ends, that's probably more work to guarantee.
Steve

-------- Original Message -------- From: Stephen J. Turnbull [mailto:turnbull.stephen.fw@u.tsukuba.ac.jp] Sent: Wednesday, April 21, 2021, 18:28 UTC
In fact, most modern systems will negotiate compressed streams, so if you provide a .txt to your webserver, the client will tell the server "hey, I know how to gunzip", the server will automatically gzip, the client gunzip, and you know nothing about it except that you have text onscreen.
It's unclear what the system will do if offered a .txt.gz file. If the server is smart, it might say
Content-Type: text/plain; name=whatever.txt <-- note: no .gz Content-Transfer-Encoding: gzip
and the end result is as above. But it's not obviously a good idea for the server to second-guess the admin that way.
It's plausible that if the server just sends it as a binary, the client will say, "oh, they gzipped it on purpose, I should treat it as a file and save it", or it might say, "I know what a .txt is, and go ahead and transparently ungzip it. Clients are reliably unreliable as a class -- some users want Do What I Mean, some what Do What I Say, and different clients will cater to different users.
Bottom line: if you're sure you want your .txt files treated as plain text and displayed as conveniently as possible, ungzip them! Very likely you won't use any more bandwidth (and by the way, modern servers tend to cache that gzipped blob in case somebody asks for it again, so on-the-fly compression doesn't necessarily waste hours of CPU).
If for some reason you'd prefer that they be gzipped at both ends, that's probably more work to guarantee.
Thanks very much for this Steve. The learning from you guys never stops :-)
And "unzip" was the pick at of the day.
Best, Mark

Mark Sapiro writes:
In short, the file contains just what it should, but there is a Content-Transfer-Encoding issue.
Technical niggle, probably not relevant to the issue itself:
The charset parameter is an attribute of Content-Type. Content-Transfer-Encoding should be transparent to this problem.
Steve
participants (3)
-
Mark Dale
-
Mark Sapiro
-
Stephen J. Turnbull