[Tutor] best way to tokenize [was script too slow]

Paul Tremblay phthenry@earthlink.net
Mon Feb 24 19:05:16 2003


Jeff Shannon wrote:


>I'm also not entirely convinced that regular expressions are the best 
>choice for lexing (breaking the file up into tokens).  They're almost a 
>default solution in Perl, but that doesn't mean they're the most 
>efficient method in Python.  I'm not completely certain what the 
>requirements for that process are, though, so I can't speak any more 
>specifically other than to suggest considering other options.

You actually guessed my next question. I avoided regular expressions in
the original perl script in all but this one place precisely because I
knew I might convert the perl script to python, and I knew that regular
expressions could be inefficient. 

However, I don't know if there is a better way to split a line of RTF.

Here is a line of RTF that exhibits each of the main type of tokens:

text \par \\ \{ \} {}

Broken into tokens:

['text', '\par', '\\', '\{', '\}', '{',   '}']

There are 7 type of tokens:

1. text

2. control word, or a backslash followed by any number of characters. A
space, backslash, or open or closed bracket ends this group.

3. escaped backslash

4. escaped open bracket

5. escaped closed bracket

6. open bracket

7. closed bracket

Here is my line to tokenize:

self.splitexp = re.compile(r"(\\[\\{}]|{|}|\\[^\s\\{}&]+(?:\s)?)")
tokens = re.split(self.splitexp, line)

Is there any way to split this line *without* using regular expressions?

I know how to use string.split("exp"), but I don't know how to preserve
the "exp" as a token. Once I know how split and save the tokens, I
imagine I can split the line into lists, then split the lists into
lists, and so on--even though I'm vague on how to do this.

But I'm not sure if this would be faster. Also, I don't know how to get
around using a regular expression for the control words. A control word
can be any length, and can take multiple forms:

'\pard ' => '\pard '
'\par\pard' => '\par', '\pard'
'\par\pard ' => '\par', '\pard '
'\par}' => '\par', '}'

Thanks for your help on using dictionaries. I believe your method may
save time, but I am waiting to learn how to use the profile module
before I can determine what is what regarding the problem I am having
with dictionaries.

Paul


-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************