Mailman3: decorating non-ascii message templates
Hey all,
I first thought about opening a bug for this, but I think it needs a small discussion first. In Mailman3 the message templates are stored on disk (welcome.txt, footer-generic.txt, ...) However, unless I missed something, there is no hint about the encoding of those files. As a result, when I try the decorate() function (from mailman.handlers.decorate) on a non-ascii file, it crashes with a classic UnicodeDecodeError: 'ascii' codec can't decode byte [...]. Note: if the file contains no string to replace (like $fqdn_listname), it is passed through unchanged and it works.
So how should we deal with this? I think that the TemplateLoader in mailman.app.templates should return unicode strings, because that's the closest to the moment when files are read, and unicode conversion should happen on the "external borders" of the application. Thus the TemplateLoader's get() method seems to be the right place. I see two options:
We require that all template files are stored in either ascii or utf-8. That's the easiest way to go, and we just decode the text after getting the file.
We use the fact that our Language entities contain encoding values. When the template is loaded from an internal URL containing the language, we add the corresponding encoding to the result metadata (to be retrieved with the info() method) and use that to decode the contents. This means that templates in non-localized directories still have to be ascii-only, and the same goes for templates retrieved from non-internal URLs (not starting with mailman://). It's more complex but it may seem more natural to the administrator, since he won't have to force UTF-8 encoding when editing a file. We must also think of not making it too hard for Postorius, which will probably only get UTF-8 posted from the webpage (since it's always displayed in UTF-8 IIRC).
I would vote for the easy way for just requiring UTF-8 encoding, but I'd like to hear your thoughts on this.
Thanks, Aurélien
http://aurelien.bompard.org ~~~~~~ xmpp:aurelien@bompard.org Tell me and I will forget. Show me and I will remember. Involve me and I will understand. -- Chinese proverb
On Oct 14, 2013, at 03:42 PM, Aurélien Bompard wrote:
I first thought about opening a bug for this, but I think it needs a small discussion first.
Thanks for the thorough investigation!
Completely agreed. We need to convert to unicode at the edges and treat strings internally as unicode. I want MM3 to eventually be a Python 3 application, so we'll have to do this anyway. I'm sure it'll be painful, but let's start now in any way we can.
I would vote for UTF-8 for all files, internal or external, but maybe there
are some languages for which this will cause problems. We do have a charset
variable in the config file for languages, but I wonder if we'll actually use
anything other than UTF-8. Note that some of the charsets in MM2.1 are not
UTF-8, but I'm not sure if any of them are UTF-8 incompatible. MM3 only
defines a setting for USA English by default, and that's currently us-ascii,
but maybe even that should be UTF-8.
So I guess unless we can identify actually languages that would be harmed by UTF-8, we should just require that. Maybe Steve can weigh in on the issue.
-Barry
Barry Warsaw writes:
+1
I would vote for UTF-8 for all files, internal or external, but maybe there are some languages for which this will cause problems.
I don't think there are, any more. There are some programmers and translators for whom it will be annoying, and maybe the Chinese government will refuse to use Mailman (they have some requirement that software use GB18030, which is Red Unicode).
(BTW, I've always been in favor of making UTF the default for text files using programs, all the way back to PEP 263.)
We do have a
charset
variable in the config file for languages,
Which is useless for the languages where we would actually care because there are multiple charsets in common use in those languages, and any setting will surely screw up some users.
that some of the charsets in MM2.1 are not UTF-8, but I'm not sure if any of them are UTF-8 incompatible.
I'll take a look but it will have to be after I get back from the Mentor Summit. Wish you were here!
Steve
participants (3)
-
Aurélien Bompard
-
Barry Warsaw
-
Stephen J. Turnbull