How to get the ascii code of Chinese characters?

Sat Aug 19 17:02:15 EDT 2006

On 2006-08-19 16:54:36, Peter Maas wrote:

> Gerhard Fiedler wrote:
>> Well, ASCII can represent the Unicode numerically -- if that is what the OP
>> wants.
> 
> No. ASCII characters range is 0..127 while Unicode characters range is
> at least 0..65535.

Actually, Unicode goes beyond 65535. But right in this sentence, you
represented the number 65535 with ASCII characters, so it doesn't seem to
be impossible. 

>> For example, "U+81EC" (all ASCII) is one possible -- not very
>> readable though <g> -- representation of a Hanzi character (see
>> http://www.cojak.org/index.php?function=code_lookup&term=81EC).
> 
> U+81EC means a Unicode character which is represented by the number
> 0x81EC. 

Exactly. Both versions represented in ASCII right in your message :)

> UTF-8 maps Unicode strings to sequences of bytes in the range 0..255,
> UTF-7 maps Unicode strings to sequences of bytes in the range 0..127.
> You *could* read the latter as ASCII sequences but this is not correct.

Of course not "correct". I guess the only "correct" representation is the
original Chinese character. But the OP doesn't seem to want this... so a
non-"correct" representation is necessary anyway.

> How to do it in Python? Let chinesePhrase be a Unicode string with
> Chinese content. Then
> 
> chinesePhrase_7bit = chinesePhrase.encode('utf-7')
> 
> will produce a sequences of bytes in the range 0..127 representing
> chinesePhrase and *looking like* a (meaningless) ASCII sequence.

Actually, no. There are quite a few code positions in the range 0..127 that
don't "look like" anything (non-printable). And, as you say, this is rather
meaningless.

> chinesePhrase_16bit = chinesePhrase.encode('utf-16be')
> 
> will produce a sequence with Unicode numbers packed in a byte
> string in big endian order. This is probably closest to what
> the OP wants.

That's what you think... but it's not really ASCII. If you want this in
ASCII, and readable, I still suggest to transform this sequence of 2-byte
values (for Chinese characters it will be 2 bytes per character) into a
sequence of something like U+81EC (or 0x81EC if you are a C fan or 81EC if
you can imply the rest)... that's where we come back to my original
suggestion :)

Gerhard