Parser Generator?

Steven Bethard steven.bethard at gmail.com
Mon Aug 27 05:48:30 CEST 2007


Paul McGuire wrote:
> On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw... at ginstrom.com> wrote:
>> The only caveat being that since Chinese and Japanese scripts don't
>> typically delimit "words" with spaces, I think you'd have to pass the text
>> through a tokenizer (like ChaSen for Japanese) before using PyParsing.
> 
> Did you think pyparsing is so mundane as to require spaces between
> tokens?  Pyparsing has been doing this type of token-recognition since
> Day 1.  Looking for tokens without delimiting spaces was one of the
> first applications for pyparsing.  This issue is not unique to Chinese
> or Japanese text.  Pyparsing will easily find the tokens in this
> string:
> 
> y=a*x**2+b*x+c
> 
> as
> 
> ['y','=','a','*','x','**','2','+','b','*','x','+','c']

The difference is that in the expression above (and in many other 
tokenization problems) you can determine "word" boundaries by looking at 
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by 
different classes of characters. They only exist in the mind of the 
reader who knows which sequences of characters could be words given the 
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the 
following sentence:

ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.

Most approaches that have been even marginally successful on these kinds 
of tasks have used statistical machine learning approaches.

STeVe



More information about the Python-list mailing list