I'm using Python 3.3 (CPython) and am having trouble getting the standard gettext module to handle Unicode messages.<div>My problem can be isolated as follows:</div><div><br></div><div>I have 3 files in a folder: greeting.py, greeting.po and msgfmt.py.</div>
<div><br></div><div>-- greeting.py --</div><div><font face="courier new, monospace">import gettext</font></div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace">t = gettext.translation("greeting", "locale", ["pt"])</font></div>
<div><font face="courier new, monospace">_ = t.lgettext</font></div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace">print("_charset = {0}\n".format(t._charset))</font></div>
<div><font face="courier new, monospace">print(_("hello"))</font></div><div>-- EOF --</div><div><br></div><div>-- greeting.po --</div><div><font face="courier new, monospace">msgid ""</font></div><div>
<font face="courier new, monospace">msgstr ""</font></div><div><font face="courier new, monospace">"Project-Id-Version: 1.0\n"</font></div><div><font face="courier new, monospace">"MIME-Version: 1.0\n"</font></div>
<div><font face="courier new, monospace">"Content-Type: text/plain; charset=UTF-8\n"</font></div><div><font face="courier new, monospace">"Content-Transfer-Encoding: 8bit\n"</font></div><div><font face="courier new, monospace"><br>
</font></div><div><font face="courier new, monospace">msgid "hello"</font></div><div><font face="courier new, monospace">msgstr "olá"</font></div><div>-- EOF --</div><div><br></div><div>msgfmt.py was downloaded from <a href="http://hg.python.org/cpython/file/9e6ead98762e/Tools/i18n/msgfmt.py">http://hg.python.org/cpython/file/9e6ead98762e/Tools/i18n/msgfmt.py</a>, since this tool apparently isn't included in the python3 package available on Arch Linux official repositories.</div>
<div><br></div><div>It's probably also worth noting that the file greeting.po is encoded itself as UTF-8.</div><div><br></div><div>From that folder, I run the following commands:</div><div><br></div><div><font face="courier new, monospace">$ mkdir -p locale/pt/LC_MESSAGES</font></div>
<div><font face="courier new, monospace">$ python msgfmt.py -o !$/greeting.mo greeting.po</font></div><div><font face="courier new, monospace">$ python greeting.py</font></div><div><br></div><div>The output is:</div><div>
<div><font face="courier new, monospace">_charset = UTF-8</font></div><div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace">Traceback (most recent call last):</font></div><div>
<font face="courier new, monospace"> File "greeting.py", line 7, in <module></font></div><div><font face="courier new, monospace"> print(_("hello"))</font></div><div><font face="courier new, monospace"> File "/usr/lib/python3.3/gettext.py", line 314, in lgettext</font></div>
<div><font face="courier new, monospace"> return tmsg.encode(locale.getpreferredencoding())</font></div><div><font face="courier new, monospace">UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 2: ordinal not in range(128)</font></div>
</div><div><br></div><div>My interpretation of this output is that even though gettext correctly detects the MO file charset as UTF-8, it tries to encode the translated message with the system's "preferred encoding", which happens to be ASCII.</div>
<div><br></div><div>Anyone know why this happens? Is this a bug on my code? Maybe I have misunderstood gettext...</div><div><br></div><div>Thanks,</div><div><br></div><div> Marcel</div>