Small issues in gettext support

Hello folks, I've been working with gettext support in Python, and found some issues I'd like to discuss with you. First, I've noticed that there's a difference in the internal implementation of gettext and GNU gettext regarding the returned encoding on non-unicode strings. Notice the difference in the result of this code: import gettext import locale locale.setlocale(locale.LC_ALL, "") locale.textdomain("apt-cdrom-registry") gettext.textdomain("apt-cdrom-registry") print locale.gettext("Choose the available CDROMs from the list below") print gettext.gettext("Choose the available CDROMs from the list below") This has shown the following: Escolha os CDROMs disponíves na lista abaixo Escolha os CDROMs disponÃves na lista abaixo The reason for this difference is clear: GNU gettext defaults to the current locale when returning encoded strings, while gettext.py returns strings in the encoding used in the .mo file. The fix is simply changing the following code # Encode the Unicode tmsg back to an 8-bit string, if possible if self._charset: return tmsg.encode(self._charset) to use the system encoding (sys.getdefaultencoding()) instead of self._charset. Regarding a similar issue, I've also noticed that we're currently missing bind_textdomain_codeset() support. This function changes the codeset used to return the translated strings. So, I'd like to implement the following changes: - Change the default codeset used by gettext.py in functions returning an encoded string to match the system encoding. - Introduce bind_textdomain_codeset() in locale. - Introduce bind_textdomain_codeset() in gettext.py implementing an equivalent functionality. Comments? -- Gustavo Niemeyer http://niemeyer.net

On Sun, 2004-04-25 at 19:40, Gustavo Niemeyer wrote:
Hello folks,
I've been working with gettext support in Python, and found some issues I'd like to discuss with you.
First, I've noticed that there's a difference in the internal implementation of gettext and GNU gettext regarding the returned encoding on non-unicode strings. Notice the difference in the result of this code:
import gettext import locale locale.setlocale(locale.LC_ALL, "") locale.textdomain("apt-cdrom-registry") gettext.textdomain("apt-cdrom-registry") print locale.gettext("Choose the available CDROMs from the list below") print gettext.gettext("Choose the available CDROMs from the list below")
This has shown the following:
Escolha os CDROMs disponíves na lista abaixo Escolha os CDROMs disponÃves na lista abaixo
The reason for this difference is clear: GNU gettext defaults to the current locale when returning encoded strings, while gettext.py returns strings in the encoding used in the .mo file. The fix is simply changing the following code
# Encode the Unicode tmsg back to an 8-bit string, if possible if self._charset: return tmsg.encode(self._charset)
to use the system encoding (sys.getdefaultencoding()) instead of self._charset.
I'd be worried most about backwards compatibility, since the module has worked this way since its early days. Also, wouldn't this be an opportunity for getting lots of UnicodeErrors? E.g. my system encoding is 'ascii' so gettext() would fail for catalogs containing non-ascii characters. I shouldn't have to change my system encoding just to avoid errors, but with your suggestion, wouldn't that make many catalogs basically unusable for me?
Regarding a similar issue, I've also noticed that we're currently missing bind_textdomain_codeset() support. This function changes the codeset used to return the translated strings.
So, I'd like to implement the following changes:
- Change the default codeset used by gettext.py in functions returning an encoded string to match the system encoding. - Introduce bind_textdomain_codeset() in locale. - Introduce bind_textdomain_codeset() in gettext.py implementing an equivalent functionality.
Would adding bind_textdomain_codeset() would provide a way for the application to change the default encoding? If so, I'd be in favor of adding bind_textdomain_codeset() but not changing the default encoding for returned strings. Then update the documentation to describe current behavior and how to change it via that function call. -Barry

I'd be worried most about backwards compatibility, since the module has worked this way since its early days. Also, wouldn't this be an
IMO, the current behavior is wrong, so breaking backwards compatibility in that case would be fixing something important.
opportunity for getting lots of UnicodeErrors? E.g. my system encoding is 'ascii' so gettext() would fail for catalogs containing non-ascii characters. I shouldn't have to change my system encoding just to avoid errors, but with your suggestion, wouldn't that make many catalogs basically unusable for me?
There are a few extra points to notice here: - Different .mo files may have different encodings. - The translation system is made in a way that the programmer should not have to worry about the encoding used by the translators. - The current scheme may introduce a wrong practice: forcing translators to use some specific encoding to avoid breaking the program. - We already have support for getting the "unicode" version of the string. This is currently the right way to get the translation in some specific encoding, since it uncouples the translation encoding from the expected encoding. - In cases where you'd get the "UnicodeError", you'd see a mangled string which would be unreadable. To avoid the UnicodeError, we may also return the original string in cases where the UnicodeError is raised.
Would adding bind_textdomain_codeset() would provide a way for the application to change the default encoding?
Yes, it changes the default encoding.
If so, I'd be in favor of adding bind_textdomain_codeset() but not changing the default encoding for returned strings. Then update the documentation to describe current behavior and how to change it via that function call.
Thanks for your suggestion! -- Gustavo Niemeyer http://niemeyer.net

opportunity for getting lots of UnicodeErrors? E.g. my system encoding is 'ascii' so gettext() would fail for catalogs containing non-ascii characters. I shouldn't have to change my system encoding just to avoid errors, but with your suggestion, wouldn't that make many catalogs basically unusable for me?
Another interesting point, from the gettext documentation: """ The output character set is, by default, the value of `nl_langinfo (CODESET)', which depends on the `LC_CTYPE' part of the current locale. But programs which store strings in a locale independent way (e.g. UTF-8) can request that `gettext' and related functions return the translation in that encoding, by use of the `bind_textdomain_codeset' function. """ So, we should use some variant of locale.nl_langinfo(locale.CODESET) to get the default encoding. Since this should be changed together with the language used when 'setlocale' is issued, the UnicodeError problems you're afraid of would also go away. -- Gustavo Niemeyer http://niemeyer.net

- The current scheme may introduce a wrong practice: forcing translators to use some specific encoding to avoid breaking the program.
That's not true. Applications should just always use ugettext.
As I said before, I agree that ugettext is the best way to get translations in a generic way. OTOH, that's not what I'm discussing here, as you may observe by the general discussion. -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
# Encode the Unicode tmsg back to an 8-bit string, if possible if self._charset: return tmsg.encode(self._charset)
to use the system encoding (sys.getdefaultencoding()) instead of self._charset.
That shouldn't be sys.getdefaultencoding(), but locale.getpreferredencoding(). However, I agree with Barry that the current behaviour should not be changed. People may already rely on gettext returning byte strings as-is.
- Change the default codeset used by gettext.py in functions returning an encoded string to match the system encoding.
No. Explicit is better that implicit; users desiring that feature should write _charset = locale.getpreferredencoding() def _(msg): return dgettext("domain", msg).encode(_charset) I advocate never to use gettext.install, in which case you have a custom _ implementation *anyway*, which would then also include the textual domain. It should not be too much effort for that function to transcode if desired.
- Introduce bind_textdomain_codeset() in locale. - Introduce bind_textdomain_codeset() in gettext.py implementing an equivalent functionality.
That is ok. You could also try to provide that feature consistently, e.g. inside .install. Regards, Martin

That shouldn't be sys.getdefaultencoding(), but locale.getpreferredencoding().
Agreed.
However, I agree with Barry that the current behaviour should not be changed. People may already rely on gettext returning byte strings as-is.
Barry also said: """ Any sane app is going to use the class-based API an the ugettext() method anyway, so maybe it does make sense to simply be API compatible with GNU gettext for the old-style interface. """
- Change the default codeset used by gettext.py in functions returning an encoded string to match the system encoding.
No. Explicit is better that implicit; users desiring that feature should write [...]
You belive that returning a string in some unpredictable encoding used by the translator is explicit? I have to disagree. We have something named 'gettext' with a different behavior than the original project, in a way that breaks the concepts used to decide the behavior of the classical implementation.
I advocate never to use gettext.install, in which case you have a custom _ implementation *anyway*, which would then [...]
"You shouldn't be using it" is not a point I'm taking into account.
- Introduce bind_textdomain_codeset() in locale. - Introduce bind_textdomain_codeset() in gettext.py implementing an equivalent functionality.
That is ok. You could also try to provide that feature consistently, e.g. inside .install.
Agreed. -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
You belive that returning a string in some unpredictable encoding used by the translator is explicit?
No. What is explicit here is that we return the bytes in the .mo file; and conversion of these bytes should be requested explicitly. As for compatibility with GNU gettext: Older versions of GNU gettext did not perform any conversion, either. I think changing the behaviour of an existing function was a bad thing to do for GNU gettext, as well, and I don't think we should repeat that mistake. It would be good to find out what users of Python gettext think about a possible change. Temporarily warning about this change of behaviour is also unacceptable. Adding another function, e.g. lgettext (for local charset), along with lngettext, ldgettext, might be another solution. Regards, Martin

You belive that returning a string in some unpredictable encoding used by the translator is explicit?
No. What is explicit here is that we return the bytes in the .mo file; and conversion of these bytes should be requested explicitly.
As for compatibility with GNU gettext: Older versions of GNU gettext did not perform any conversion, either. I think changing the behaviour of an existing function was a bad thing to do for GNU gettext, as well, and I don't think we should repeat that mistake.
I wasn't aware about this.
It would be good to find out what users of Python gettext think about a possible change.
I'd be glad to hear from them as well. Anyone?
Temporarily warning about this change of behaviour is also unacceptable. Adding another function, e.g. lgettext (for local charset), along with lngettext, ldgettext, might be another solution.
Ack. Thanks for discussing. -- Gustavo Niemeyer http://niemeyer.net

On Wed, 2004-04-28 at 14:15, "Martin v. Löwis" wrote:
Temporarily warning about this change of behaviour is also unacceptable. Adding another function, e.g. lgettext (for local charset), along with lngettext, ldgettext, might be another solution.
That would work for me. I'd much rather see a different interface to support what Gustavo wants than to change an existing interface. -Barry

However, I agree with Barry that the current behaviour should not be changed. People may already rely on gettext returning byte strings as-is.
Btw, about people relying on the current implementation, we could introduce this as a warning in 2.4, which could be disabled by using "bind_textdomain_codeset()" for the expected translation, and then turn the behavior into the default one on 2.4+1. -- Gustavo Niemeyer http://niemeyer.net
participants (3)
-
"Martin v. Löwis"
-
Barry Warsaw
-
Gustavo Niemeyer