[Tutor] A regular expression problem
Evert Rol
evert.rol at gmail.com
Sun Nov 28 18:03:07 CET 2010
<snip intro>
> Here's what I do. This was just a first attempt to get strings
> starting with a non alpha-numeric symbol. If this had worked, I would
> have continued to build the regular expression to get words with non
> alpha-numeric symbols in the middle and in the end. Alas, even this
> first attempt didn't work.
>
> ---------
> with open('output_tokens.txt', 'a') as out_tokens:
> with open('text.txt', 'r') as in_tokens:
> t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
> output = t.tokenize(in_tokens.read())
> for item in output:
> out_tokens.write(" %s" % (item))
>
> --------
>
> What puzzles me is that I get some results that don't make much sense
> given the regular expression. Here's some excerpt from the text I'm
> processing:
>
> ---------------
> "<filename=B-05-Libro_Oersino__14-214-2.txt>
>
> %Pág. 87
> &L-[LIBRO VII. DE OÉRSINO]&L+ &//
> §Comeza el ·VII· libro, que es de Oérsino las bístias. &//
> §Canto Félix ha tomado prenda del phisoloffo, el […] ·II· hómnes, e ellos"
> ----------------
>
>
> Here's the relevant part of the output file ('output_tokens.txt'):
>
> ----------
> " <filename= -05- _Oersino__14- -2. %Pág. &L- [LLIBRO ÉRSINO] &L+
> §Comenza ·VII· ístias. §Canto élix ·II· ómnes"
> -----------
>
> If you notice, there are some words that have an accented character
> that get treated in a strange way: all the characters that don't have
> a tilde get deleted and the accented character behaves as if it were a
> non alpha-numeric symbol.
>
> What is going on? What am I doing wrong?
I don't know for sure, but I would hazard a guess that you didn't specify unicode for the regular expression: character classes like \w and \s are dependent on your LOCALE settings.
A flag like re.UNICODE could help, but I don't know if Regexptokenizer accepts that.
It would also appear that you could get a long way with the builtin re.split function, and supply the flag inside that function; no need then or Regexptokenizer. Your tokenizer just appears to split on the tokens you specify.
Lastly, an output convenience:
output.write(' '.join(list(output)))
instead of the for-loop.
(I'm casting output to a list here, since I don't know whether output is a list or an iterator.)
Let us know how if UNICODE (or other LOCALE settings) can solve your problem.
Cheers,
Evert
More information about the Tutor
mailing list