Python 3.3, gettext and Unicode problems
Terry Reedy
tjreedy at udel.edu
Sun Dec 30 21:26:22 EST 2012
On 12/30/2012 8:48 PM, Terry Reedy wrote:
> On 12/30/2012 7:39 PM, Marcel Rodrigues wrote:
>> I'm using Python 3.3 (CPython) and am having trouble getting the
>> standard gettext module to handle Unicode messages.
Addition to previous response.
>> import gettext
>>
>> t = gettext.translation("greeting", "locale", ["pt"])
Reading further, I see that this returns a GNUTranslations instance
>> _ = t.lgettext
So this calls its method:
'''
GNUTranslations.gettext(message)
Look up the message id in the catalog and return the corresponding
message string, as a Unicode string. If there is no entry in the catalog
for the message id, and a fallback has been set, the look up is
forwarded to the fallback’s gettext() method. Otherwise, the message id
is returned.
GNUTranslations.lgettext(message)
Equivalent to gettext(), but the translation is returned as a bytestring
encoded in the selected output charset, or in the preferred system
encoding if no encoding was explicitly set with set_output_charset().
'''
So if you want the unicode translation to be utf-8 encoded, either use
.gettext and encode it yourself, or use "t.set_output_charset('utf-8')"
to have it done automatically.
>> print("_charset = {0}\n".format(t._charset))
>> print(_("hello"))
But since you are printing to screen, I suggest using .gettext and let
print do the encoding to the screen encoding. If that still raises an
encoding error, then the problem is the console emulator. On windows,
for instance, IDLE windows handle the entire BMP charset while the
stupid Windows Command Prompt window does not (certainly not by default,
and not yet, as far I know).
The encoding of the translations file on disk determines how the bytes
of the translation table should be *decoded when read, to create unicode
strings. It does not determine how those strings should be *encoded*
when sent to a particular destination. That may depend on the
destination. Multilingual international sites used to encode pages in
different limited national encodings, according to the language and
destination. Now many encode and send *everything* as utf-8. I think
this is the proper policy now. .lgettext seems oriented to the older,
pre utf-8, national locale encoding way of doing things.
--
Terry Jan Reedy
More information about the Python-list
mailing list