[Tutor] A regular expression problem

Sun Nov 28 17:27:45 CET 2010

I'm trying to use regular expressions to extract strings that match
certain patterns in a collection of texts. Basically these texts are
edited versions of medieval manuscripts that use certain symbols to
mark information that is useful for filologists.

I'm interested in isolating words that have some non alpha-numeric
symbol attached to the beginning or the end of the word or inserted in
them. Here are some examples:

'¿de' ,'«orden', '§Don', '·II·', 'que·l', 'Rey»'

I'm using some modules from a package called NLTK but I think my
problem is related to some misunderstanding of how regular expressions
work.

Here's what I do. This was just a first attempt to get strings
starting with a non alpha-numeric symbol. If this had worked, I would
have continued to build the regular expression to get words with non
alpha-numeric symbols in the middle and in the end. Alas, even this
first attempt didn't work.

---------
with open('output_tokens.txt', 'a') as out_tokens:
    with open('text.txt', 'r') as in_tokens:
        t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
        output = t.tokenize(in_tokens.read())
        for item in output:
            out_tokens.write(" %s" % (item))

--------

What puzzles me is that I get some results that don't make much sense
given the regular expression. Here's some excerpt from the text I'm
processing:

---------------
"<filename=B-05-Libro_Oersino__14-214-2.txt>

%Pág. 87
&L-[LIBRO VII. DE OÉRSINO]&L+ &//
§Comeza el ·VII· libro, que es de Oérsino las bístias. &//
 §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos"
----------------

Here's the relevant part of the output file ('output_tokens.txt'):

----------
 " <filename= -05- _Oersino__14- -2. %Pág. &L- [LLIBRO ÉRSINO] &L+
§Comenza ·VII· ístias. §Canto élix ·II· ómnes"
-----------

If you notice, there are some words that have an accented character
that get treated in a strange way: all the characters that don't have
a tilde get deleted and the accented character behaves as if it were a
non alpha-numeric symbol.

What is going on? What am I doing wrong?

Josep M.