Problem processing Chinese character with Python

Anthony Liu antonyliu2002 at yahoo.com
Sat Mar 6 05:05:11 EST 2004


Andrew gave me a sample code with let me read a text
file sentence by sentence.

Suppose I just wanna read the part between 2 full
stops each time.

It works nicely with English text files, where the
full stop is a dot (.).  

But when I tried to read Chinese text files, I found
that it sometimes reads a few sentences at one time.

I guess the reason is that in Chinese, the full stop
is not a dot (.), but a little circle, as many of you
probably know.  

Indeed, if I replace the Chinese full stop with the
dot.  It nicely gets only one sentence each time.

So, how should I fix this problem?  I am really having
headache processing Chinese characters with Python.

Here is the sample code that Andrew offered:

def bytes(f):
    # Below: f.read(2) to process Chinese
    for byte in iter(lambda: f.read(1), ''):
        yield byte

def sentences(iterable):
    sentence = ''
    for char in iterable:
        sentence += char
        # The little cirlce is the Chinese
        # full stop. Some of might not be able
        # view it if you don't have
        # east Asian language support.
        if char in ('。','.'):
            yield sentence.strip()
            sentence = ''
    sentence = sentence.strip()
    if sentence:
        yield sentence


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com




More information about the Python-list mailing list