replacing Chinese chars with their spellings

John Machin sjmachin at lexicon.net
Thu Apr 25 06:12:21 EDT 2002


Boudewijn Rempt <boud at valdyas.org> wrote in message news:<3cc7967a$0$37911$e4fe514c at dreader3.news.xs4all.nl>...
> John Machin wrote:
> > 
> > Presumably the point of having a multi-character pronunciation table
> > is that it is possible that pronounce("xy") can be != pronounce("x") +
> > pronounce("y"). With careful thought, you may be able to remove
> > redundant entries from your more-than-one-char dicts, so that they
> > contain only the necessary exception cases -- but do try the basic
> > approach first.
> > 
> 
> Isn't big-5 a variable length encoding? I thought that was his
> problem, not translating two or more character words.

big5 is a 1-2 byte encoding. A byte 0-127 is more-or-less ASCII; a
byte 128-255 (or less) is the first byte of a two-byte Chinese
character. So it's variable-length only to that extent.

Contrary to popular mythology, Chinese words can have more than one
syllable. As the OP said:

> ["big5" 2, 4, 6 ... byte long strings] there
> with their pronunciations.  If it were just one character [two byte]
> words I would use the "c2t" program.



More information about the Python-list mailing list