help needed with regex and unicode

Tue Mar 4 01:51:35 EST 2008

On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:

> I have a file which contains chinese characters. I just want to find out
> all the places that these chinese characters occur.
> 
> The following script doesn't seem to work :(
> 
> **********************************************************************
> class RemCh(object):
>     def __init__(self, fName):
>         self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
>         fp = open(fName, 'r')
>         content = fp.read()
>         s = re.search('[\u2F00-\u2fdf]', content, re.U)
>         if s:
>             print s.group(0)
> if __name__ == '__main__':
>     rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
> **********************************************************************
> 
> the php file content is something like the following:
> 
> **********************************************************************
>     // Check if the folder still has subscribed blogs
>     $subCount = function1($param1, $param2);
>     if ($subCount > 0) {
>         $errors['summary'] = 'Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
>         $errorMessage  = 'Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
>     }

Looks like an UTF-8 encoded file viewed as ISO-8859-1.  Sou you should
decode `content` to unicode before searching the chinese characters.

Ciao,
	Marc 'BlackJack' Rintsch