[Tutor] help related to unicode using python

Thu Mar 21 05:08:36 CET 2013

Reply inline.
On 21/03/13 12:18 AM, Steven D'Aprano wrote:
> On 20/03/13 22:38, nishitha reddy wrote:
>> Hi all
>> i'm working with unicode using python
>> i have some txt files in telugu i want to split all the lines of that
>> text files in to words of telugu
>> and i need to classify  all of them using some identifiers.can any one
>> send solution for that
>
>
> Probably not. I would be surprised if anyone here knows what Telugu is,
> or the rules for splitting Telugu text into words. The Natural Language
> Toolkit (NLTK) may be able to handle it.
>
> You could try doing the splitting and classifying yourself. If Telugu
> uses
> space-delimited words like English, you can do it easily:
>
> data = u"ఏఐఒ ఓఔక ఞతణథ"
> words = data.split()
Unicode characters for telugu:
http://en.wikipedia.org/wiki/Telugu_alphabet#Unicode

On python 3.x,

>>> import re
>>> a='ఏఐఒ ఓఔక ఞతణథ'
>>> print(a)
ఏఐఒ ఓఔక ఞతణథ
>>> re.split('[^\u0c01-\u0c7f]', a)
['ఏఐఒ', 'ఓఔక', 'ఞతణథ']

Similar logic can be used for any other Indic script.

HTH.

-- 
शंतनू