help needed with regex and unicode
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Tue Mar 4 01:51:35 EST 2008
On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:
> I have a file which contains chinese characters. I just want to find out
> all the places that these chinese characters occur.
>
> The following script doesn't seem to work :(
>
> **********************************************************************
> class RemCh(object):
> def __init__(self, fName):
> self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
> fp = open(fName, 'r')
> content = fp.read()
> s = re.search('[\u2F00-\u2fdf]', content, re.U)
> if s:
> print s.group(0)
> if __name__ == '__main__':
> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
> **********************************************************************
>
> the php file content is something like the following:
>
> **********************************************************************
> // Check if the folder still has subscribed blogs
> $subCount = function1($param1, $param2);
> if ($subCount > 0) {
> $errors['summary'] = 'æÂï½ æ½å¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
> $errorMessage = 'æÂï½ æ½å¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
> }
Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
decode `content` to unicode before searching the chinese characters.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list