txt.gz character encoding
![](https://secure.gravatar.com/avatar/5d9df7cbe86df0f4b0f976351573954b.jpg?s=120&d=mm&r=g)
I am trying to download Pipermail archives from http://lists.xwiki.org/pipermail/users/. They are offered in txt.gz files. I now understand that even though it is not immediately obvious, I can download the uncompressed .txt versions by modifying the URL, and the resulting files are fine. But if I download one of the txt.gz files and unzip it to create a .txt file the results are undecipherable. It looks like a different character encoding was used. The beginning of the unzipped file has the host server's path in clear text at the top (information that is not in the .txt file downloaded directly, BTW), but the rest is gibberish. Is there something special about the process that Pipermail uses to produce the .gz files, or is this something xwiki.org might have changed?
--Gary
![](https://secure.gravatar.com/avatar/56f108518d7ee2544412cc80978e3182.jpg?s=120&d=mm&r=g)
Gary Kopp wrote:
Mailman/pipermail creates the .txt.gz files in one of two ways depending on configuration, but both use the same underlying process. In either case, the message being archived is appended to the .txt text file.
In the default case, that's all that's done, but Mailman's cron/nightly_gzip is run overnight to (re)create the .txt.gz file from the .txt file.
If the installation has set GZIP_ARCHIVE_TXT_FILES to a true value in mm_cfg.py, when the message is added to the .txt file, the .txt.gz is (re)created from the .txt file at that time. This involves more overhead than the default but avoids the issue of messages added during a day not being in the .txt.gz file until the next day.
In my case, I avoid both the overhead and the delay issue by just not running cron/nightly_gzip. Then the files served from the archive TOC page are the .txt files as there are no .txt.gz files.
None of the above addresses your question however. To answer your question, whether the gzipping is done on the fly by pipermail or nightly by cron/nightly_gzip or both, it is done via the Python gzip module which in turn relies on the Python zlib module to do the actual comperssion.
It appears that there is something in this process in the xwiki.org installation that actually gzips the file twice.
[msapiro@MSAPIRO ~/Desktop]$ file 2012-July.txt.gz 2012-July.txt.gz: gzip compressed data, from Unix [msapiro@MSAPIRO ~/Desktop]$ gunzip 2012-July.txt.gz [msapiro@MSAPIRO ~/Desktop]$ file 2012-July.txt 2012-July.txt: gzip compressed data, was "/var/lib/mailman/archives/private/users/2012-July.txt", last modified: Fri Jul 27 20:27:03 2012, max compression
I.e., it appears the /var/lib/mailman/archives/private/users/2012-July.txt was compressed by gzip with it's (default) --name option and then the result was gzipped again.
You can recover the original .txt file from the .txt.gz file in this case by, e.g.
[msapiro@MSAPIRO ~/Desktop]$ gunzip 2012-July.txt.gz [msapiro@MSAPIRO ~/Desktop]$ mv 2012-July.txt 2012-July.txt.gz [msapiro@MSAPIRO ~/Desktop]$ gunzip --no-name 2012-July.txt.gz
This situation is specific to the xwiki.org installation.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
![](https://secure.gravatar.com/avatar/56f108518d7ee2544412cc80978e3182.jpg?s=120&d=mm&r=g)
Gary Kopp wrote:
Mailman/pipermail creates the .txt.gz files in one of two ways depending on configuration, but both use the same underlying process. In either case, the message being archived is appended to the .txt text file.
In the default case, that's all that's done, but Mailman's cron/nightly_gzip is run overnight to (re)create the .txt.gz file from the .txt file.
If the installation has set GZIP_ARCHIVE_TXT_FILES to a true value in mm_cfg.py, when the message is added to the .txt file, the .txt.gz is (re)created from the .txt file at that time. This involves more overhead than the default but avoids the issue of messages added during a day not being in the .txt.gz file until the next day.
In my case, I avoid both the overhead and the delay issue by just not running cron/nightly_gzip. Then the files served from the archive TOC page are the .txt files as there are no .txt.gz files.
None of the above addresses your question however. To answer your question, whether the gzipping is done on the fly by pipermail or nightly by cron/nightly_gzip or both, it is done via the Python gzip module which in turn relies on the Python zlib module to do the actual comperssion.
It appears that there is something in this process in the xwiki.org installation that actually gzips the file twice.
[msapiro@MSAPIRO ~/Desktop]$ file 2012-July.txt.gz 2012-July.txt.gz: gzip compressed data, from Unix [msapiro@MSAPIRO ~/Desktop]$ gunzip 2012-July.txt.gz [msapiro@MSAPIRO ~/Desktop]$ file 2012-July.txt 2012-July.txt: gzip compressed data, was "/var/lib/mailman/archives/private/users/2012-July.txt", last modified: Fri Jul 27 20:27:03 2012, max compression
I.e., it appears the /var/lib/mailman/archives/private/users/2012-July.txt was compressed by gzip with it's (default) --name option and then the result was gzipped again.
You can recover the original .txt file from the .txt.gz file in this case by, e.g.
[msapiro@MSAPIRO ~/Desktop]$ gunzip 2012-July.txt.gz [msapiro@MSAPIRO ~/Desktop]$ mv 2012-July.txt 2012-July.txt.gz [msapiro@MSAPIRO ~/Desktop]$ gunzip --no-name 2012-July.txt.gz
This situation is specific to the xwiki.org installation.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Gary Kopp
-
Mark Sapiro