[Tutor] A regular expression problem

Tue Nov 30 14:47:20 CET 2010

Sorry, something went wrong and my message got sent before I could
finish it. I'll try again.

On Tue, Nov 30, 2010 at 2:19 PM, Josep M. Fontana
<josep.m.fontana at gmail.com> wrote:
> On Sun, Nov 28, 2010 at 6:03 PM, Evert Rol <evert.rol at gmail.com> wrote:
> <snip intro>
 <snip>
>> ---------
>> with open('output_tokens.txt', 'a') as out_tokens:
>>    with open('text.txt', 'r') as in_tokens:
>>        t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
>>        output = t.tokenize(in_tokens.read())
>>        for item in output:
>>            out_tokens.write(" %s" % (item))
>
> I don't know for sure, but I would hazard a guess that you didn't specify unicode for the regular expression: character classes like \w and \s are dependent on your LOCALE settings.
> A flag like re.UNICODE could help, but I don't know if Regexptokenizer accepts that.

 OK, this must be the problem. The text is in ISO-8859-1 not in
Unicode. I tried to fix the problem by doing the following:

-------------
import codecs
[...]
 with codecs.open('output_tokens.txt', 'a',  encoding='iso-8859-1') as
out_tokens:
    with codecs.open('text.txt', 'r',  encoding='iso-8859-1') as in_tokens:
        t = RegexpTokenizer('[^a-zA-Z\s0-9]+\w+\S')
        output = t.tokenize(in_tokens.read())
        for item in output:
             out_tokens.write(" %s" % (item))

-------------------

Specifying that the encoding is 'iso-8859-1' didn't do anything,
though. The output I get is still the same.

>> It would also appear that you could get a long way with the builtin re.split function, and supply the flag inside that function; no need then or Regexptokenizer. Your tokenizer just appears to split on the tokens you specify.

Yes. This is in fact what Regexptokenizer seems to do. Here's what the
little description of the class says:

"""
    A tokenizer that splits a string into substrings using a regular
    expression.  The regular expression can be specified to match
    either tokens or separators between tokens.

    Unlike C{re.findall()} and C{re.split()}, C{RegexpTokenizer} does
    not treat regular expressions that contain grouping parenthases
    specially.
    """

source:
http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/tokenize/regexp.py?r=8539

Since I'm using the NLTK package and this module seemed to do what I
needed, I thought I might as well use it. I thought (and I still do)
the problem I was didn't have to do with the correct use of this
module but in the way I constructed the regular expression. I wouldn't
have asked the question here if I thought that the problem had to do
with this module.

If I understand correctly how the re.split works, though, I don't
think I would obtain the results I want, though.

re.split would allow me to get a list of the strings that occur around
the pattern I specify as the first argument in the function, right?
But what I want is to match all the words that contain some non
alpha-numeric character in them and exclude the rest of the words.
Since these words are surrounded by spaces or by line returns or a
combination thereof, just as the other "normal" words, I can't think
of any pattern that I can use in re.split() that would discriminate
between the two types of strings. So I don't know how I would do what
I want with re.split.

Josep M.