How to use Unicode regexes?

Martin von Loewis loewis at informatik.hu-berlin.de
Sat Jul 28 09:44:10 CEST 2001


rhys tucker <rhystucker at rhystucker.fsnet.co.uk> writes:

> Could somebody show me how to do Unicode regexes? I'm trying to
> write a strings-like utility for windows - so I want to match ascii
> and unicode characters in a binary file. Do I need one regex pattern
> since ascii and Unicode are similar for ascii text characters or are
> 2 regex patterns needed since they are different byte sizes?

You can use exactly the same regular expression for both byte and
Unicode strings, but this seems not to be your question.

It is not clear to me what exactly you are trying to achieve.  What do
you mean by "unicode characters in a binary file"? In a binary file,
there are no characters, only bytes. You need to know what encoding
was used for the Unicode strings (UTF-8, UCS-2, ...) before being able
to determine whether a certain Unicode string appears in a certain
file.

> The documentation suggest that I need to use \w pattern to match
> Unicode and set UNICODE. I'm not sure what and how to set Unicode.

Where does it say that? \w is about "alphanumeric characters", it says
that \w matches all characters that are marked as alphanumeric in the
Unicode character database if the UNICODE flag is set. To match
Unicode strings, you don't need \w at all:

>>> re.search(u"al", u"Hallo")
<SRE_Match object at 0x81db868>

This finds one Unicode strng in another; no need for \w or the UNICODE
flag.

To specify the UNICODE flag, either pass re.UNICODE as the second argument
to re.compile, or wrap your entire expression into (?u...).

> This is what I've done so far - it matches (some ?) ascii characters
> but misses those unicode strings.

It seems that you really are looking for UCS-2 strings in the
file. The Unicode facilities in Python are then of no use for you: You
need to understand how the encoding works, and formulate a pattern
based on that.

Regards,
Martin




More information about the Python-list mailing list