[Tutor] German Umlaut

Nicole Seitz nicole.seitz@urz.uni-hd.de
Mon, 1 Apr 2002 20:05:58 +0100


Am Dienstag, 26. M=E4rz 2002 17:06 schrieben Sie:

> It will match extended characters (eg. French accents, German Umlaut
> letters, etc.). Here is a longer explanation:
>
> \w
> When the LOCALE and UNICODE flags are not specified, matches any
> alphanumeric character; this is equivalent to the set [a-zA-Z0-9_]. Wit=
h
> LOCALE, it will match the set [0-9_] plus whatever characters are defin=
ed
> as letters for the current locale. If UNICODE is set, this will match t=
he
> characters [0-9_] plus whatever is classified as alphanumeric in the
> Unicode character properties database.
> Source: http://www.python.org/doc/current/lib/re-syntax.html

This reminds me of another problem I'm trying to deal with:

I'd like to extract from a text nouns that are coordinated with the word=20
"und". I wrote a little regex that seemed to work well at first sight.But=
=20
then I realized that it didn't match nouns with umlauts.
This was my first try:

reg =3D re.compile(r"\b[A-Z][a-z]+-? und [A-Z][a-z]+\b")

Output:

Arm und Reich
Sinn und Unsinn
Schule und Universit
Rock- und Popgr=20

The last to matches should be:

Schule und Universit=E4t  (aumlaut)
Rock- und Popgr=F6=DFen (oumlaut, )

Reading your email I changed my regex to:

reg =3D re.compile(r"\b[A-Z]\w+-? und [A-Z]\w+\b",re.UNICODE)

But this still doesn't match nouns like "=DCbung", i.d. the capitel lette=
r is=20
an umlaut.How can I deal with that??

Thanx and Happy Easter! (Wrote this email 2 days ago.Guess I'm now a bit=20
late.)

Nicole