HTML entities (é) in es, it, no translations
Working on sending MIME emails through Mailman, I noticed that some of the translations are inconsistent in how they use HTML entity escapes. This becomes a problem when sending email. An example from the Spanish translation: #: Mailman/Cgi/create.py:221 bin/newlist:204 msgid "Your new mailing list: %(listname)s" msgstr "Su nuebva lista de distribución: %(listname)s" This is a real problem, because this string is sent literally -- with the string "ó" -- as the subject of the new email message. I looked in the HTML 4.01 standard and found that HTML entities are actually only intended to be used when the document's character set does not support that particular character. http://www.w3.org/TR/html401/charset.html has more information on this. Since Mailman's CGI interface (in almost all cases) sends the correct charset in the Content-Type header, I think it's not necessary to use HTML entity escapes in the gettext catalog files. In fact, when we do use escapes, it makes text emails generated by Mailman illegible. Does anyone have any comments? I would like to go through the catalogs and change the HTML escapes back into the original characters, so that emails Mailman generates are correct again. The CGI interface will still work as before. Here is a first guess at which translations include HTML escapes besides < > and : [ben@nausicaa:~/src/mailman/mailman/messages]% egrep '&[^;]+;' **/*.po | egrep -v ' |<|>' | cut -d : -f 1 | uniq es/LC_MESSAGES/mailman.po it/LC_MESSAGES/mailman.po no/LC_MESSAGES/mailman.po So, the changes would only actually apply to the Spanish, Italian, and Norwegian translations. The rest of the translations are correctly in their original character sets. Ben -- Brought to you by the letters H and G and the number 18. "To Perl, or not to Perl, that is the kvetching." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Ben Gertzfield escribió:
...Does anyone have any comments? I would like to go through the catalogs and change the HTML escapes back into the original characters, so that emails Mailman generates are correct again...
Please, don't change any entity. I will do that. My changes on spanish catalog will be finished next week and then I will send them to Barry. Cheers -- ___ / F \ [[[]]]] ( O O ) #----------------0000--(_)--0000---------------# | Juan Carlos Rey Anaya (jcrey@uma.es) | | Servicio Central de informática | | Universidad de Málaga - España | #----------------------------------------------#
"Juan" == Juan Carlos Rey Anaya <jcrey@uma.es> writes:
Ben> ...Does anyone have any comments? I would like to go through Ben> the catalogs and change the HTML escapes back into the Ben> original characters, so that emails Mailman generates are Ben> correct again... Juan> Please, don't change any entity. I will do that. My changes Juan> on spanish catalog will be finished next week and then I Juan> will send them to Barry. Great! Thanks very much, Juan. Ben -- Brought to you by the letters Z and J and the number 4. "You should be glad you don't have diaper rash. Mah Jongg." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Ben Gertzfield <che@debian.org> writes:
Does anyone have any comments?
I agree that the message catalogs should use the preferred encoding of the language, and not HTML entity or character references. There are a few issues to double-check before going forward with that, though: - for this to work, Mailman needs to properly declare the encoding of each generated HTML page, and the declaration needs to match the actual content. For Latin-1, this is not strictly necessary, since that is the default encoding of HTML, anyway, but there may be plans to move to XHTML some day, at which time even this assumption breaks. - Problems will arise if Mailman inserts strings from various sources into the same template, especially if these use different encodings. If that can ever happen, you need to recode all strings to the same encoding. If that fails (e.g. because the encoding is unknown, or because the string cannot be represented in the encoding), HTML entities may be your only option. Please have a look at http://www2.iro.umontreal.ca/~pinard/po/registry.cgi?team=tr This document is encoded in ISO-8859-9 (for Turkish); but it still contains French accepts. Using entities is the only choice here, short of using UTF-8 for the entire page. In short, using the language's preferred encoding requires Mailman to carefully track the encoding of the message through its entire processing chain. If the encoding is supported by the codecs library, an alternative would be to use ugettext (so that the encoding is implied by the string being a Unicode object). Unfortunately, not all encodings in mailman are supported (the East Asians ones are missing). In general, I'd encourage usage of Unicode throughout in mailman, even if this means that additional codecs must be bundled with the distribution. Regards, Martin
"Martin" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:
Martin> - for this to work, Mailman needs to properly declare the Martin> encoding of each generated HTML page, and the declaration Martin> needs to match the actual content. For Latin-1, this is Martin> not strictly necessary, since that is the default encoding Martin> of HTML, anyway, but there may be plans to move to XHTML Martin> some day, at which time even this assumption breaks. Actually, to be precise, HTML 4.01's native encoding is Unicode, which Latin-1 happens to be a (very small) subset of. Martin> - Problems will arise if Mailman inserts strings from Martin> various sources into the same template, especially if Martin> these use different encodings. If that can ever happen, Martin> you need to recode all strings to the same encoding. If Martin> that fails (e.g. because the encoding is unknown, or Martin> because the string cannot be represented in the encoding), Right now, I don't think Mailman does that anywhere. If it does, I think the best thing to do is to convert to Unicode. Unfortunately, as much as I'd like, we can't make *everything* Unicode, because a lot of older browsers still don't support it. Martin> This document is encoded in ISO-8859-9 (for Turkish); Martin> but it still contains French accepts. Using entities is Martin> the only choice here, short of using UTF-8 for the entire Martin> page. Yes. This kind of issue will come up only in two places in Mailman: 1) on the admin request page (for bounce handling, etc) 2) in the archives (a pipermail issue) Martin> Unfortunately, not all encodings in mailman are supported Martin> (the East Asians ones are missing). In general, I'd Martin> encourage usage of Unicode throughout in mailman, even if Martin> this means that additional codecs must be bundled with the Martin> distribution. Which East Asian ones are missing? Mailman CVS works beautifully for me with Japanese, and the screenshot I sent earlier today shows Chinese (both simplified and traditional) working in email. Barry and I have talked a lot about bundling codecs with Mailman, and he's agreed with me that we need to do it. The Japanese codec is in a good state and will be easy enough to ship; the Chinese ones are only available in CVS that I know of, so we will need to make a proper distribution. Ben -- Brought to you by the letters T and N and the number 12. "Hoosh is a kind of soup." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Ben Gertzfield <che@debian.org> writes:
Actually, to be precise, HTML 4.01's native encoding is Unicode, which Latin-1 happens to be a (very small) subset of.
To be really precise, HTML 4.01's "document character set" is the "Universal Character Set" (as defined in ISO 10646), see http://www.w3.org/TR/html4/charset.html What the character encoding is is a different matter (Unicode is not a character encoding); that is transmitted as part of the HTTP response. As the document above points out, the default encoding, if none is specified, is Latin-1 (they also point out that it is bad to rely on that).
Unfortunately, as much as I'd like, we can't make *everything* Unicode, because a lot of older browsers still don't support it.
That is completely irrelevant; Unicode is *not* a character encoding. In this context, it is a Python internal datatype. When producing HTML document, strings of that type need to be encoded in the target document encoding (which definitely will *not* be Unicode, but perhaps a Unicode encoding, such as UTF-8, or some other encoding).
Which East Asian ones are missing? Mailman CVS works beautifully for me with Japanese, and the screenshot I sent earlier today shows Chinese (both simplified and traditional) working in email.
Python does not currently include codecs for iso-2022-jp, gb2312, big5, euc-jp, shift-jis. Since mailman leaves all strings as-is, and never mixes encodings, it can let them pass through unmodified. There are a number of pitfalls, though: - On mailing lists, people may use different encodings; some of the common combinations might be: European languages: ISO-8859-1, ISO-8859-15 (for the Euro), UTF-8 Japanese: ISO-2022-JP, eucJP, shift-jis, UTF-8 Chinese: gb2312, big5 This is probably an archive problem only; however, if mailman adds a footer, it will produce garbage if the footer encoding differs from the message body encoding. - To analyse the subject, Mailman needs to strip off the subject_prefix from the incoming message. If the message uses a MIME-encoded header, it may be that the subject prefix is base64 encoded. Currently, mailman fails to strip the prefix in this case. There is a patch on SF that tries to decode the subject. If the encoding is not known to Python, this will still fail. - To produce HTML pages, mailman needs to quote markup characters. For some encodings (e.g. iso-2022-jp), HTML markup character such as '<' may also occur as part of the multi-byte encoding. For these encodings, mailman currently performs no quoting at all. This is incorrect if an iso-2022-jp message contains a true '<' character, which would need to be converted to '<'.
The Japanese codec is in a good state and will be easy enough to ship; the Chinese ones are only available in CVS that I know of, so we will need to make a proper distribution.
I'd encourage you to have a look at the iconv codec also. If the system iconv is powerful enough (e.g. on Linux glibc), all encodings of the world would be supported with that single codec. Regards, Martin
"Martin" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:
Martin> To be really precise, HTML 4.01's "document character set" Martin> is the "Universal Character Set" (as defined in ISO Martin> 10646), see Yes, many thanks. Martin> Python does not currently include codecs for iso-2022-jp, Martin> gb2312, big5, euc-jp, shift-jis. Since mailman leaves all Martin> strings as-is, and never mixes encodings, it can let them Martin> pass through unmodified. There are a number of pitfalls, Martin> though: I have been working actively on these problems. Hopefully we can ship these codecs with Mailman 2.1. Martin> This is probably an archive problem only; however, if Martin> mailman adds a footer, it will produce garbage if the Martin> footer encoding differs from the message body encoding. Martin> - To analyse the subject, Mailman needs to strip off the Martin> subject_prefix from the incoming message. The subject and footer issue is a good one, and needs some work. We basically need a map of charset -> localized "Re:" prefixes; with the new email module's i18n support, it's trivial to decode headers and make sure we don't add a [PREFIX] to a message with Re: [PREFIX] in the local language. I know German uses AW: -- does anyone have a list of commonly-used response prefixes in other languages? Japanese uses Re:, as far as I know. As for the footer.. hm. Needs more thought. I doubt anyone wants to add an attachment for the footer; I think the best thing to do would be to look up the body's charset in a table and attach a properly localized footer if it's found. If it's not found, no footer is attached. If the charset is not specified, assume us-ascii. What do you think? Martin> - To produce HTML pages, mailman needs to quote markup Martin> characters. For some encodings (e.g. iso-2022-jp), HTML Martin> markup character such as '<' may also occur as part of the Martin> multi-byte encoding. For these encodings, mailman Martin> currently performs no quoting at all. This is incorrect if Martin> an iso-2022-jp message contains a true '<' character, Martin> which would need to be converted to '<'. I have written a Python module that deals with this problem directly for iso-2022-jp; it would also be possible by converting to Unicode, doing HTML escaping, then converting to the output format. http://nausicaa.interq.or.jp/mailman/JisEscape.py Martin> I'd encourage you to have a look at the iconv codec Martin> also. If the system iconv is powerful enough (e.g. on Martin> Linux glibc), all encodings of the world would be Martin> supported with that single codec. Ah, if only all systems had such an iconv codec. I'm surprised iconv is so powerful on Linux glibc, yet gettext does not support iso-2022-jp directly. Ben -- Brought to you by the letters M and W and the number 4. "Ohhhh, Mentos Boy!" Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Ben Gertzfield <che@debian.org> writes:
As for the footer.. hm. Needs more thought. I doubt anyone wants to add an attachment for the footer; I think the best thing to do would be to look up the body's charset in a table and attach a properly localized footer if it's found. If it's not found, no footer is attached. If the charset is not specified, assume us-ascii.
What do you think?
Sounds good. If the necessary codecs are available, one might try to recode the footer in the charset of the message, if no properly encoded footer is available.
Martin> I'd encourage you to have a look at the iconv codec Martin> also. If the system iconv is powerful enough (e.g. on Martin> Linux glibc), all encodings of the world would be Martin> supported with that single codec.
Ah, if only all systems had such an iconv codec.
Should that stop you from using iconv where available? On a Debian system, you'll know it is present :-)
I'm surprised iconv is so powerful on Linux glibc, yet gettext does not support iso-2022-jp directly.
I'm not surprised. The traditional encoding on Unix is eucJP; and gettext/libc can transparently recode the message to any target format. So the catalog could be in eucJP, or utf-8, and you still could produce messages in iso-2022-jp. I just tried it on a catalog; msgfmt complained about two problems: for one thing, it complains that iso-2022-jp is not a portable character set name, i.e. that many systems apparently don't recognize this character set name. The other problem is that it complains about illegal escape sequences. I don't know where this comes from; it might be a gettext mbcs bug, or a glibc bug, or an error in my data. If you have real data which are not properly processed by msgfmt, please report that bug to Bruno Haible. I can't see any reason why gettext(3) would have any difficulties with iso-2022-jp. Regards Martin
"Martin" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:
Martin> I can't see any reason why gettext(3) would have any Martin> difficulties with iso-2022-jp. It's possible that " could exist within a iso-2022-jp string, I believe; this would probably throw off gettext's parsing when you get a string like: "\eB(stuff stuff stuff " more stuff stuff\eb" It's too bad that gettext doesn't seem to have an alternate quoting mechanism. Ben -- Brought to you by the letters G and V and the number 9. "What's different, Pete, about the 69 that makes it so exciting to you?" Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Ben Gertzfield <che@debian.org> writes:
Martin> I can't see any reason why gettext(3) would have any Martin> difficulties with iso-2022-jp.
It's possible that " could exist within a iso-2022-jp string, I believe; this would probably throw off gettext's parsing when you get a string like:
That would be a msgfmt(1) problem, right? I doubt gettext(3) has any problems if you manage to compile the catalog. As for the msgfmt problem, I think msgfmt uses the mbcs routines of the C library, which should be capable of distinguishing between a byte that is part of a multi-byte character, and an individual ASCII byte. So I think this should be implementable without any change to the syntax; if it still fails, you might want to submit a bug report. Regards, Martin
Hi,
I'm not surprised. The traditional encoding on Unix is eucJP; and gettext/libc can transparently recode the message to any target format. So the catalog could be in eucJP, or utf-8, and you still could produce messages in iso-2022-jp.
I just tried it on a catalog; msgfmt complained about two problems: for one thing, it complains that iso-2022-jp is not a portable character set name, i.e. that many systems apparently don't recognize this character set name. The other problem is that it complains about illegal escape sequences. I don't know where this comes from; it might be a gettext mbcs bug, or a glibc bug, or an error in my data. If you have real data which are not properly processed by msgfmt, please report that bug to Bruno Haible.
I can't see any reason why gettext(3) would have any difficulties with iso-2022-jp.
iso-2022-jp conatins " and \ in the second byte. Big5 also. I think iso-2022-jp is only for message transportation and not good for text manipulation like within Mailman. I still recommend all the messages and templates should be encoded in EUC-JP within Mailman and Web Interfaces, and converting them into iso-2022-jp when mail is in and out. -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/
"Tokio" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
Tokio> I still recommend all the messages and templates should be Tokio> encoded in EUC-JP within Mailman and Web Interfaces, and Tokio> converting them into iso-2022-jp when mail is in and out. This is exactly what the patch I sent to the list yesterday does. Ben -- Brought to you by the letters L and O and the number 11. "It is sad. *Campers* cannot *dance*. Not even a *party*." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Ben Gertzfield wrote:
"Tokio" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
Tokio> I still recommend all the messages and templates should be Tokio> encoded in EUC-JP within Mailman and Web Interfaces, and Tokio> converting them into iso-2022-jp when mail is in and out.
This is exactly what the patch I sent to the list yesterday does.
Well then, why do you need iso-2022-jp in gettext messages ? -- Tokio Kikuchi
"Tokio" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
Tokio> Well then, why do you need iso-2022-jp in gettext messages Tokio> ? I don't.. The discussion was just about how it was interesting that gettext does not/can not support iso-2022-jp. We have no need of using iso-2022-jp gettext strings in Mailman. Sorry if I confused you. Ben -- Brought to you by the letters M and L and the number 9. "It makes my nipples hard!" Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
iso-2022-jp contains " and \ in the second byte. Big5 also.
That should not be a problem: the MBCS routines of the C library (or, rather, iconv) should be able to take care of that (if the encoding is known); so the problem can be solved. I know it isn't, but anybody interested in solving it could either report a bug or maybe even design a patch. Regards, Martin
Quoting Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp>:
iso-2022-jp conatins " and \ in the second byte. Big5 also.
I think iso-2022-jp is only for message transportation and not good for text manipulation like within Mailman.
I still recommend all the messages and templates should be encoded in EUC-JP within Mailman and Web Interfaces, and converting them into iso-2022-jp when mail is in and out.
The GNU gettext utilities should be able to handle these charsets without problem (provided you have gettext >= 0.10.36). There were a number of changes between 0.10.35 and 36, so that it takes the encoding into account when processing the file (so that the second byte of a multi-byte character doesn't need to be escaped; in fact, if you did escape bytes like that, the file won't work with newer gettext). Whether the python tools do the same when processing a message catalog is another matter. James. -- James Henstridge <james@daa.com.au>
participants (5)
-
Ben Gertzfield
-
James Henstridge
-
Juan Carlos Rey Anaya
-
Martin von Loewis
-
Tokio Kikuchi