No explanation for weird behavior in re module!

Mon Feb 11 05:09:16 EST 2002

"synthespian" wrote:

> As I understood it from the other posts, the "\w+" on
> the regex will depend on my locales.

not necessarily.  I suggest reading the RE documentation
again; look for the ?L and ?u flags.  if neither flag is present,
the engine assumes plain ASCII.

> Seems like non-ASCII is a real bother in Python...

it isn't.  you just have to know how things work.

1) text files contain *encoded* data.  each character in the
text is encoded as one or more bytes.

2) if you read a line of text from a file, you get an encoded
string.

3) to decode an encoded string into a string of well-defined
characters, you have to know what encoding it uses.

4) to decode a string, use the decode method on the input
string, and pass it the name of the encoding:

    fileencoding = "iso-8859-1"

    raw = file.readline()
    str = raw.decode(fileencoding)

the result is a unicode string.

5) to create a regular expression pattern that uses Unicode
character classes for \w, use the "(?u)" prefix, or the re.UNICODE
flag:

    pattern = re.compile("(?u)pattern")
    pattern = re.compile("pattern", re.UNICODE)

that's it.

6) to print a unicode string to your output device, you have to
convert it to the encoding used by your terminal.

    import locale
    language, output_encoding = locale.getdefaultlocale()

    print str.encode(output_encoding)

(there are lots of shortcuts, including coded streams, using default
locales for pattern matching, iso-8859-1 as a subset of unicode, etc,
but that's outside the scope of this post).

</F>