Re: [Mailman-Developers] [Mailman-checkins] SF.net SVN: mailman: [7858] trunk/mailman

Hi Barry,
I had some hours playing with the new svn trunk. (I was a little bit busy because our academic year begins April.)
bwarsaw@users.sourceforge.net wrote:
I get this for a fresh install of svn trunk. You may have old install remained, if you haven't experienced this.
% bin/mailmanctl start Traceback (most recent call last): File "bin/mailmanctl", line 112, in ? from Mailman.Logging.Syslog import syslog ImportError: No module named Logging.Syslog
Also, if you send SIGHUP to reopen the logs, only the last reopen messages is recorded because each runners try to reopen the log file. We may have to restart qrunners if mailmanctl receive SIGHUP and it has started new log files. We may also utilize the backupCount feature for log rotation (intruducing LOG_BACKUP_COUNT in Defaults.py).
-- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/

On Sun, 2006-04-23 at 08:30 +0900, Tokio Kikuchi wrote:
Try r7871. I think I've fixed this now.
I decided not to use the RotatingFileHandler and leave file rotation to external tools like logrotate. Instead I implemented a subclass of FileHandler that allows for reopening the log files (I wonder why this isn't part of the base FileHandler).
One thing we may have to do though is set the log file encoding. What do you think about that?
-Barry

At 11:59 PM -0400 2006-04-23, Barry Warsaw wrote:
One thing we may have to do though is set the log file encoding. What do you think about that?
Log file encoding? I'm not sure I understand what you mean. I
can think of a few different ways that could be interpreted, and I don't know for sure that any of them are the meaning you intended to convey.
Could you clarify and/or elaborate?
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
LOPSA member since December 2005. See <http://www.lopsa.org/>.

Brad Knowles wrote:
Well, it should be a mess. :-(
Consider mailman get a spam from a foreign country and caused an error. Mailman may complain UnicodeDecodeError and spew an excerpt containing unknown charset string. This is certainly not printable if there is no encoding which means only us-ascii is accepted for the log file. Even if you determine the charset for your language (eg. euc-jp for japanese), you still get error for a chinese spam.
It may be useful if the log output use 'replace' feature of encode() method.
-- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/

"Tokio" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
Tokio> Consider mailman get a spam from a foreign country and
Tokio> caused an error. Mailman may complain UnicodeDecodeError
Tokio> and spew an excerpt containing unknown charset string.
This really should not happen. Mailman should trap *all* UnicodeDecodeErrors at a very low level. (You simply cannot yet count on malformed message == SPAM in all contexts yet. Eg, just last week the Mac users here started flaming the Windows-using administration for distributing mojibake.)
Then it should wash the message to make it safe. RFC 2047-encode any 8-bit headers, and use a base64 Content-Transfer-Encoding for any 8-bit message bodies or body parts that don't have a known, approved charset specified. Bonus points for checking that 8-bit body parts with a specified charset actually conform to it.
Finally, reraise some kind of exception that can be handled at the filtering policy level.
-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Mon, 2006-04-24 at 17:12 +0900, Stephen J. Turnbull wrote:
The general approach should be that /everything/ gets converted to Unicode at the boundaries of the system. In Mailman 2.1, all the Unicode and i18n stuff was bolted on afterward, which is why we've had so much pain throughout, dealing with Unicode conversions. Ideally, we'd get rid of all that for 2.2 and deal only with Unicode internally.
We may have to make modifications to the email package though, but I'm not sure. It should probably always return Unicode for everything.
That sounds about right. Probably the email package should convert everything to Unicode internally and place Defects on the message objects that have illegal encodings.
-Barry

"BAW" == Barry Warsaw <barry@python.org> writes:
BAW> Ideally, we'd get rid of all that for 2.2 and deal only with
BAW> Unicode internally.
The original encoded stuff should be squirreled away somewhere for debugging and maybe spam detection, though.
BAW> We may have to make modifications to the email package
BAW> though, but I'm not sure. It should probably always return
BAW> Unicode for everything.
That would be my recommendation (modulo preserving the original headers at least, and probably the original body too, for debugging etc).
-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Tue, 2006-04-25 at 23:23 +0900, Stephen J. Turnbull wrote:
"BAW" == Barry Warsaw <barry@python.org> writes:
We still have some time to do this in the email package for Python 2.5, but not much. PEP 356 says that Python 2.5 beta 1 is scheduled for June 24th, and after that we'll be feature frozen.
Can we discuss any necessary changes on the email-sig, please?
-Barry

On Mon, 2006-04-24 at 15:19 +0900, Tokio Kikuchi wrote:
Well, it should be a mess. :-(
I'm hoping we can make it less so!
That's probably a good idea. Also, I'm wondering if we should allow users to set the log file encoding in Defaults.py, or whether we should force utf-8, or try to interrogate the system for the encoding value.
Basically, the logger should be as liberal as possible, just in case we let encoding problems slip through (more on that in the next follow up).
-Barry

At 2:25 PM -0400 2006-04-24, Barry Warsaw wrote:
Personally, I think we should default to US-ASCII in the log
files, but I can see where some people might want to select a different encoding in mm_cfg.py.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
LOPSA member since December 2005. See <http://www.lopsa.org/>.

"Brad" == Brad Knowles <brad@stop.mail-abuse.org> writes:
Brad> Personally, I think we should default to US-ASCII in
Brad> the log files, but I can see where some people might want to
Brad> select a different encoding in mm_cfg.py.
I really think the log files should be UTF-8. The point is to make them as ASCII as possible, but if you've got readable garbage that you want to log, it should be readable. People who lack the fonts or whatever wouldn't be able to read it anyway; people who can will be able to convert the UTF-8 to something they can use.
If the garbage doesn't seem to be readable (eg, naked 8-bit crap in the headers) then it should be BASE64'd in the logs.
-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Sun, 2006-04-23 at 08:30 +0900, Tokio Kikuchi wrote:
Try r7871. I think I've fixed this now.
I decided not to use the RotatingFileHandler and leave file rotation to external tools like logrotate. Instead I implemented a subclass of FileHandler that allows for reopening the log files (I wonder why this isn't part of the base FileHandler).
One thing we may have to do though is set the log file encoding. What do you think about that?
-Barry

At 11:59 PM -0400 2006-04-23, Barry Warsaw wrote:
One thing we may have to do though is set the log file encoding. What do you think about that?
Log file encoding? I'm not sure I understand what you mean. I
can think of a few different ways that could be interpreted, and I don't know for sure that any of them are the meaning you intended to convey.
Could you clarify and/or elaborate?
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
LOPSA member since December 2005. See <http://www.lopsa.org/>.

Brad Knowles wrote:
Well, it should be a mess. :-(
Consider mailman get a spam from a foreign country and caused an error. Mailman may complain UnicodeDecodeError and spew an excerpt containing unknown charset string. This is certainly not printable if there is no encoding which means only us-ascii is accepted for the log file. Even if you determine the charset for your language (eg. euc-jp for japanese), you still get error for a chinese spam.
It may be useful if the log output use 'replace' feature of encode() method.
-- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/

"Tokio" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
Tokio> Consider mailman get a spam from a foreign country and
Tokio> caused an error. Mailman may complain UnicodeDecodeError
Tokio> and spew an excerpt containing unknown charset string.
This really should not happen. Mailman should trap *all* UnicodeDecodeErrors at a very low level. (You simply cannot yet count on malformed message == SPAM in all contexts yet. Eg, just last week the Mac users here started flaming the Windows-using administration for distributing mojibake.)
Then it should wash the message to make it safe. RFC 2047-encode any 8-bit headers, and use a base64 Content-Transfer-Encoding for any 8-bit message bodies or body parts that don't have a known, approved charset specified. Bonus points for checking that 8-bit body parts with a specified charset actually conform to it.
Finally, reraise some kind of exception that can be handled at the filtering policy level.
-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Mon, 2006-04-24 at 17:12 +0900, Stephen J. Turnbull wrote:
The general approach should be that /everything/ gets converted to Unicode at the boundaries of the system. In Mailman 2.1, all the Unicode and i18n stuff was bolted on afterward, which is why we've had so much pain throughout, dealing with Unicode conversions. Ideally, we'd get rid of all that for 2.2 and deal only with Unicode internally.
We may have to make modifications to the email package though, but I'm not sure. It should probably always return Unicode for everything.
That sounds about right. Probably the email package should convert everything to Unicode internally and place Defects on the message objects that have illegal encodings.
-Barry

"BAW" == Barry Warsaw <barry@python.org> writes:
BAW> Ideally, we'd get rid of all that for 2.2 and deal only with
BAW> Unicode internally.
The original encoded stuff should be squirreled away somewhere for debugging and maybe spam detection, though.
BAW> We may have to make modifications to the email package
BAW> though, but I'm not sure. It should probably always return
BAW> Unicode for everything.
That would be my recommendation (modulo preserving the original headers at least, and probably the original body too, for debugging etc).
-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.

On Tue, 2006-04-25 at 23:23 +0900, Stephen J. Turnbull wrote:
"BAW" == Barry Warsaw <barry@python.org> writes:
We still have some time to do this in the email package for Python 2.5, but not much. PEP 356 says that Python 2.5 beta 1 is scheduled for June 24th, and after that we'll be feature frozen.
Can we discuss any necessary changes on the email-sig, please?
-Barry

On Mon, 2006-04-24 at 15:19 +0900, Tokio Kikuchi wrote:
Well, it should be a mess. :-(
I'm hoping we can make it less so!
That's probably a good idea. Also, I'm wondering if we should allow users to set the log file encoding in Defaults.py, or whether we should force utf-8, or try to interrogate the system for the encoding value.
Basically, the logger should be as liberal as possible, just in case we let encoding problems slip through (more on that in the next follow up).
-Barry

At 2:25 PM -0400 2006-04-24, Barry Warsaw wrote:
Personally, I think we should default to US-ASCII in the log
files, but I can see where some people might want to select a different encoding in mm_cfg.py.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
LOPSA member since December 2005. See <http://www.lopsa.org/>.

"Brad" == Brad Knowles <brad@stop.mail-abuse.org> writes:
Brad> Personally, I think we should default to US-ASCII in
Brad> the log files, but I can see where some people might want to
Brad> select a different encoding in mm_cfg.py.
I really think the log files should be UTF-8. The point is to make them as ASCII as possible, but if you've got readable garbage that you want to log, it should be readable. People who lack the fonts or whatever wouldn't be able to read it anyway; people who can will be able to convert the UTF-8 to something they can use.
If the garbage doesn't seem to be readable (eg, naked 8-bit crap in the headers) then it should be BASE64'd in the logs.
-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
participants (4)
-
Barry Warsaw
-
Brad Knowles
-
Stephen J. Turnbull
-
Tokio Kikuchi