[Tutor] A regular expression problem

Sun Nov 28 18:14:01 CET 2010

Josep M. Fontana wrote:
> I'm trying to use regular expressions to extract strings that match
> certain patterns in a collection of texts. Basically these texts are
> edited versions of medieval manuscripts that use certain symbols to
> mark information that is useful for filologists.
> 
> I'm interested in isolating words that have some non alpha-numeric
> symbol attached to the beginning or the end of the word or inserted in
> them. Here are some examples:
> 
> '¿de' ,'«orden', '§Don', '·II·', 'que·l', 'Rey»'

Have you considered just using the isalnum() method?

 >>> '¿de'.isalnum()
False

You will have to split your source text into individual words, then 
isolate those where word.isalnum() returns False.

> I'm using some modules from a package called NLTK but I think my
> problem is related to some misunderstanding of how regular expressions
> work.

The first thing to do is to isolate the cause of the problem. In your 
code below, you do four different things. In no particular order:

1 open and read an input file;
2 open and write an output file;
3 create a mysterious "RegexpTokenizer" object, whatever that is;
4 tokenize the input.

We can't run your code because:

1 we don't have access to your input file;
2 most of us don't have the NLTK package;
3 we don't know what RegexTokenizer does;
4 we don't know what tokenizing does.

Makes it hard to solve the problem for you, although I'm willing to make 
a few wild guesses (see below).

The most important debugging skill you can learn is to narrow the 
problem down to the smallest possible piece of code that gives you the 
wrong answer. This will help you solve the problem yourself, and it will 
also help others help you. Can you demonstrate the problem in a couple 
of lines of code that doesn't rely on external files, packages, or other 
code we don't have?

> Here's what I do. This was just a first attempt to get strings
> starting with a non alpha-numeric symbol. If this had worked, I would
> have continued to build the regular expression to get words with non
> alpha-numeric symbols in the middle and in the end. Alas, even this
> first attempt didn't work.
> 
> ---------
> with open('output_tokens.txt', 'a') as out_tokens:
>     with open('text.txt', 'r') as in_tokens:
>         t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
>         output = t.tokenize(in_tokens.read())
>         for item in output:
>             out_tokens.write(" %s" % (item))

Firstly, it's best practice to write regexes as "raw strings" by 
preceding them with an r. Instead of

'[^a-zA-Z\s0-9]+\w+\S'

you should write:

r'[^a-zA-Z\s0-9]+\w+\S'

Notice that the r is part of the delimiter (r' and ') and not the 
contents. This instructs Python to ignore the special meaning of 
backslashes. In this specific case, it won't make any difference, but it 
will make a big difference in other regexes.

Your regex says to match:

- one or more characters that aren't letters a...z (in either
   case), space or any digit (note that this is *not* the same as
   characters that aren't alphanum);

- followed by one or more alphanum character;

- followed by exactly one character that is not whitespace.

I'm guessing the "not whitespace" is troublesome -- it will match 
characters like ¿ because it isn't whitespace.

I'd try this:

# untested
\b.*?\W.*?\b

which should match any word with a non-alphanumeric character in it:

- \b ... \b matches the start and end of the word;

- .*? matches zero or more characters (as few as possible);

- \W matches a single non-alphanumeric character.

So putting it all together, that should match a word with at least one 
non-alphanumeric character in it.

(Caution: if you try this, you *must* use a raw string, otherwise you 
will get completely wrong results.)

> What puzzles me is that I get some results that don't make much sense
> given the regular expression.

Well, I don't know how RegexTokenizer is supposed to work, so anything I 
say will be guesswork :)

[...]
> If you notice, there are some words that have an accented character
> that get treated in a strange way: all the characters that don't have
> a tilde get deleted and the accented character behaves as if it were a
> non alpha-numeric symbol.

Your regex matches if the first character isn't a space, a digit, or a 
a-zA-Z. Accented characters aren't a-z or A-Z, and therefore will match.

-- 
Steven