[Tutor] A regular expression problem

Sun Nov 28 18:03:07 CET 2010

<snip intro>

> Here's what I do. This was just a first attempt to get strings
> starting with a non alpha-numeric symbol. If this had worked, I would
> have continued to build the regular expression to get words with non
> alpha-numeric symbols in the middle and in the end. Alas, even this
> first attempt didn't work.
> 
> ---------
> with open('output_tokens.txt', 'a') as out_tokens:
>    with open('text.txt', 'r') as in_tokens:
>        t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
>        output = t.tokenize(in_tokens.read())
>        for item in output:
>            out_tokens.write(" %s" % (item))
> 
> --------
> 
> What puzzles me is that I get some results that don't make much sense
> given the regular expression. Here's some excerpt from the text I'm
> processing:
> 
> ---------------
> "<filename=B-05-Libro_Oersino__14-214-2.txt>
> 
> %Pág. 87
> &L-[LIBRO VII. DE OÉRSINO]&L+ &//
> §Comeza el ·VII· libro, que es de Oérsino las bístias. &//
> §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos"
> ----------------
> 
> 
> Here's the relevant part of the output file ('output_tokens.txt'):
> 
> ----------
> " <filename= -05- _Oersino__14- -2. %Pág. &L- [LLIBRO ÉRSINO] &L+
> §Comenza ·VII· ístias. §Canto élix ·II· ómnes"
> -----------
> 
> If you notice, there are some words that have an accented character
> that get treated in a strange way: all the characters that don't have
> a tilde get deleted and the accented character behaves as if it were a
> non alpha-numeric symbol.
> 
> What is going on? What am I doing wrong?

I don't know for sure, but I would hazard a guess that you didn't specify unicode for the regular expression: character classes like \w and \s are dependent on your LOCALE settings. 
A flag like re.UNICODE could help, but I don't know if Regexptokenizer accepts that.
It would also appear that you could get a long way with the builtin re.split function, and supply the flag inside that function; no need then or Regexptokenizer. Your tokenizer just appears to split on the tokens you specify.

Lastly, an output convenience:
    output.write(' '.join(list(output)))
instead of the for-loop.
(I'm casting output to a list here, since I don't know whether output is a list or an iterator.)

Let us know how if UNICODE (or other LOCALE settings) can solve your problem.

Cheers,

  Evert