steven.bethard at gmail.com
Mon Aug 27 05:48:30 CEST 2007
Paul McGuire wrote:
> On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw... at ginstrom.com> wrote:
>> The only caveat being that since Chinese and Japanese scripts don't
>> typically delimit "words" with spaces, I think you'd have to pass the text
>> through a tokenizer (like ChaSen for Japanese) before using PyParsing.
> Did you think pyparsing is so mundane as to require spaces between
> tokens? Pyparsing has been doing this type of token-recognition since
> Day 1. Looking for tokens without delimiting spaces was one of the
> first applications for pyparsing. This issue is not unique to Chinese
> or Japanese text. Pyparsing will easily find the tokens in this
The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.
In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.
The closest analog would be to ask pyparsing to find the words in the
Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.
More information about the Python-list