Problem processing Chinese character with Python

Anthony Liu antonyliu2002 at yahoo.com
Sun Mar 7 03:01:36 EST 2004


Hey, I fiddled with the Chinese punctuations, and it
can work elegantly now.

Thanks a lot!

--- Andrew Bennetts <andrew-pythonlist at puzzling.org>
wrote:
> On Sat, Mar 06, 2004 at 02:05:11AM -0800, Anthony
> Liu wrote:
> > Andrew gave me a sample code with let me read a
> text
> > file sentence by sentence.
> > 
> > Suppose I just wanna read the part between 2 full
> > stops each time.
> > 
> > It works nicely with English text files, where the
> > full stop is a dot (.).  
> > 
> > But when I tried to read Chinese text files, I
> found
> > that it sometimes reads a few sentences at one
> time.
> 
> Yep -- you'll notice I'm reading bytes, but the
> sentences generator is
> expecting characters.  That assumption holds for
> ASCII, but not many other
> encodings.
> 
> You need some way of reading *characters*, rather
> than bytes from the file.
> To do this you need to know the encoding of the file
> (of course), and then I
> guess you need to try to decode the bytes as you
> read them in.  I'm just a
> boring mono-lingual English speaker, so I haven't
> really played with unicode
> much, but I guess something along these lines would
> work:
> 
> def characters(textFile, encoding):
>     bytes = ''
>     for byte in iter(lambda: textFile.read(1), ''):
>         bytes += byte
>         try:
>             yield bytes.decode(encoding)
>         except TypeError:
>             pass
>         else:
>             bytes = ''
>     if bytes:
>         yield bytes.decode(encoding)
>         
> Hopefully someone who knows more about unicode will
> tell me if I've somehow
> got this completely wrong.
>         
> Again, reading one byte at a time is pretty
> inefficient.  You can probably
> optimise fairly easily by reading and decoding large
> chunks.
> 
> -Andrew.
> 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com




More information about the Python-list mailing list