help needed with regex and unicode
Mark Tolonen
mark.e.tolonen at mailinator.com
Tue Mar 4 02:43:32 EST 2008
"Marc 'BlackJack' Rintsch" <bj_666 at gmx.net> wrote in message
news:6349rmF23qmbmU1 at mid.uni-berlin.de...
> On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:
>
>> I have a file which contains chinese characters. I just want to find out
>> all the places that these chinese characters occur.
>>
>> The following script doesn't seem to work :(
>>
>> **********************************************************************
>> class RemCh(object):
>> def __init__(self, fName):
>> self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
>> fp = open(fName, 'r')
>> content = fp.read()
>> s = re.search('[\u2F00-\u2fdf]', content, re.U)
>> if s:
>> print s.group(0)
>> if __name__ == '__main__':
>> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
>> **********************************************************************
>>
>> the php file content is something like the following:
>>
>> **********************************************************************
>> // Check if the folder still has subscribed blogs
>> $subCount = function1($param1, $param2);
>> if ($subCount > 0) {
>> $errors['summary'] = 'æÂï½ æ½å¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
>> $errorMessage = 'æÂï½ æ½å¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
>> }
>
> Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
> decode `content` to unicode before searching the chinese characters.
>
I couldn't get your data to decode into anything resembling Chinese, so I
created my own file as an example. If reading an encoded text file, it
comes
in as just a bunch of bytes:
>>> print open('chinese.txt','r').read()
我是美国人。 Wǒ shì Měiguórén. I am an American.
Garbage, because the encoding isn't known. Provide the correct encoding and
decode it to Unicode:
>>> print open('chinese.txt','r').read().decode('utf8')
我是美国人。 Wǒ shì Měiguórén. I am an American.
Here's the Unicode string. Note the 'u' before the quotes to indicate
Unicode.
>>> s=open('chinese.txt','r').read().decode('utf8')
>>> s
u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002 W\u01d2 sh\xec
M\u011bigu\xf3r\xe9n. I am an American.'
If working with Unicode strings, the re module should be provided Unicode
strings also:
>>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0)
我
>>> print re.findall(ur'[\u4E00-\u9FA5]',s)
[u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba']
Hope that helps you.
--Mark
More information about the Python-list
mailing list