Need a Regular expression to remove a char for Unicode text

harvey.thomas at informa.com harvey.thomas at informa.com
Fri Oct 13 07:44:26 EDT 2006


శ్రీనివాస wrote:
> Hai friends,
> Can any one tell me how can i remove a character from a unocode text.
> కల్‌&హార is a Telugu word in Unicode. Here i want to
> remove '&' but not replace with a zero width char. And one more thing,
> if any whitespaces are there before and after '&' char, the text should
> be kept as it is. Please tell me how can i workout this with regular
> expressions.
>
> Thanks and regards
> Srinivasa Raju Datla

Don't know anything about Telugu, but is this the approach you want?

>>> x=u'\xfe\xff & \xfe\xff \xfe\xff&\xfe\xff'
>>> noampre = re.compile('(?<!\s)&(?!\s)', re.UNICODE).sub
>>> noampre('', x)
u'\xfe\xff & \xfe\xff \xfe\xff\xfe\xff'

The regular expression has negative look behind and look ahead
assertions to check that there is no whitespace surrounding the '&'
character. Each match then found is then  replaced with the empty string




More information about the Python-list mailing list