No explanation for weird behavior in re module!
Fredrik Lundh
fredrik at pythonware.com
Mon Feb 11 05:09:16 EST 2002
"synthespian" wrote:
> As I understood it from the other posts, the "\w+" on
> the regex will depend on my locales.
not necessarily. I suggest reading the RE documentation
again; look for the ?L and ?u flags. if neither flag is present,
the engine assumes plain ASCII.
> Seems like non-ASCII is a real bother in Python...
it isn't. you just have to know how things work.
1) text files contain *encoded* data. each character in the
text is encoded as one or more bytes.
2) if you read a line of text from a file, you get an encoded
string.
3) to decode an encoded string into a string of well-defined
characters, you have to know what encoding it uses.
4) to decode a string, use the decode method on the input
string, and pass it the name of the encoding:
fileencoding = "iso-8859-1"
raw = file.readline()
str = raw.decode(fileencoding)
the result is a unicode string.
5) to create a regular expression pattern that uses Unicode
character classes for \w, use the "(?u)" prefix, or the re.UNICODE
flag:
pattern = re.compile("(?u)pattern")
pattern = re.compile("pattern", re.UNICODE)
that's it.
6) to print a unicode string to your output device, you have to
convert it to the encoding used by your terminal.
import locale
language, output_encoding = locale.getdefaultlocale()
print str.encode(output_encoding)
(there are lots of shortcuts, including coded streams, using default
locales for pattern matching, iso-8859-1 as a subset of unicode, etc,
but that's outside the scope of this post).
</F>
More information about the Python-list
mailing list