[Tutor] Printing Chinese characters?

Thu Oct 16 02:47:31 EDT 2003

On Wed, 15 Oct 2003, Neal McBurnett wrote:

> Ahh - and the final step - that would yield this utf-8 encoding (of the
> original string minus the troublesome characters) rendered as a python
> string:
>
> print
> '\xe7\xaa\xaa\xe6\xb4\x89\xe9\x83\xbd\xe7\x8d\x97\xe8\x85\x94\xe3
> \x82\x81\xe8\xa1\xa7\xe7\xaa\xaa\xe8\x9d\xa5\xe7\xba\x97\xe5\xa5
> \xb4\x0a'

Ah, then it is UTF-8 then?  Oh, I must have introduced some weird
characters when I copied and pasted.  You're right!  Oh, cool!

###
>>> s = ('\xe7\xaa\xaa\xe6\xb4\x89\xe9\x83\xbd\xe7\x8d\x97\xe8'
...    + '\x85\x94\xe3\x82\x81\xe8\xa1\xa7\xe7\xaa\xaa\xe8\x9d'
...    + '\xa5\xe7\xba\x97\xe5\xa5\xb4').decode('utf8')
>>> s
u'\u7aaa\u6d09\u90fd\u7357\u8154\u3081\u8867\u7aaa\u8765\u7e97\u5974'
###

There, now it's decoding properly.  Yes, it matches what Neal decoded:

> > U+7AAA kDefinition hollow; pit; depression; swamp
> > U+90FD kDefinition metropolis, capital; all, the whole; elegant,
> > refined
> > U+7357 kDefinition unruly, wild, violent, lawless
> > U+8154 kDefinition chest cavity; hollow in body
> > U+7AAA kDefinition hollow; pit; depression; swamp
> > U+8765 kDefinition a fly which is used similarly to cantharides
> > U+5974 kDefinition slave, servant

Wow, that sounds rather... um... grim.  *grin*

Most web browsers have native support for utf8-encoded files, so, in a
pinch, you might be able to see the message this way:

###
msg = ('\xe7\xaa\xaa\xe6\xb4\x89\xe9\x83\xbd\xe7\x8d\x97\xe8'
       + '\x85\x94\xe3\x82\x81\xe8\xa1\xa7\xe7\xaa\xaa\xe8\x9d'
       + '\xa5\xe7\xba\x97\xe5\xa5\xb4')
print """<!doctype HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>%s</p>
</body>
</html>""" % msg
###

Redirect the result of this to an HTML file, and then try browsing it.

If you're still having trouble seeing it, visit:

    http://hkn.eecs.berkeley.edu/~dyoo/weird_chinese_msg.pdf

I've printed it out as a PDF as a stopgap measure if you're really
desperate to see the Chinese characters.  *grin*

But does anyone know if ReportLab's happy with UTF-8 characters?

> > > There is an interesting comment under CJK encodings (Chinese, Japanese,
> > > Korean) as follows:
> > >     # The codecs for these encodings are not distributed with the
> > >     # Python core, but are included here for reference, since the
> > >     # locale module relies on having these aliases available.
> > >
> > > Do you (or anyone else) know where I can get the Chinese encodings,
> > > including BIG-5?

Here you go:

    http://cjkpython.i18n.org/

It looks like we won't need them this time, but if we run across BIG5
encoded files, we'll know what to do to transform them to utf8 now.

Good luck!