Ben Gertzfield firstname.lastname@example.org writes:
Does anyone have any comments?
I agree that the message catalogs should use the preferred encoding of the language, and not HTML entity or character references. There are a few issues to double-check before going forward with that, though:
- for this to work, Mailman needs to properly declare the encoding of each generated HTML page, and the declaration needs to match the actual content. For Latin-1, this is not strictly necessary, since that is the default encoding of HTML, anyway, but there may be plans to move to XHTML some day, at which time even this assumption breaks.
- Problems will arise if Mailman inserts strings from various sources into the same template, especially if these use different encodings. If that can ever happen, you need to recode all strings to the same encoding. If that fails (e.g. because the encoding is unknown, or because the string cannot be represented in the encoding), HTML entities may be your only option. Please have a look at
This document is encoded in ISO-8859-9 (for Turkish); but it still contains French accepts. Using entities is the only choice here, short of using UTF-8 for the entire page.
In short, using the language's preferred encoding requires Mailman to carefully track the encoding of the message through its entire processing chain. If the encoding is supported by the codecs library, an alternative would be to use ugettext (so that the encoding is implied by the string being a Unicode object).
Unfortunately, not all encodings in mailman are supported (the East Asians ones are missing). In general, I'd encourage usage of Unicode throughout in mailman, even if this means that additional codecs must be bundled with the distribution.