[Tutor] best way to tokenize [was script too slow]
Paul Tremblay
phthenry@earthlink.net
Mon Feb 24 19:05:16 2003
Jeff Shannon wrote:
>I'm also not entirely convinced that regular expressions are the best
>choice for lexing (breaking the file up into tokens). They're almost a
>default solution in Perl, but that doesn't mean they're the most
>efficient method in Python. I'm not completely certain what the
>requirements for that process are, though, so I can't speak any more
>specifically other than to suggest considering other options.
You actually guessed my next question. I avoided regular expressions in
the original perl script in all but this one place precisely because I
knew I might convert the perl script to python, and I knew that regular
expressions could be inefficient.
However, I don't know if there is a better way to split a line of RTF.
Here is a line of RTF that exhibits each of the main type of tokens:
text \par \\ \{ \} {}
Broken into tokens:
['text', '\par', '\\', '\{', '\}', '{', '}']
There are 7 type of tokens:
1. text
2. control word, or a backslash followed by any number of characters. A
space, backslash, or open or closed bracket ends this group.
3. escaped backslash
4. escaped open bracket
5. escaped closed bracket
6. open bracket
7. closed bracket
Here is my line to tokenize:
self.splitexp = re.compile(r"(\\[\\{}]|{|}|\\[^\s\\{}&]+(?:\s)?)")
tokens = re.split(self.splitexp, line)
Is there any way to split this line *without* using regular expressions?
I know how to use string.split("exp"), but I don't know how to preserve
the "exp" as a token. Once I know how split and save the tokens, I
imagine I can split the line into lists, then split the lists into
lists, and so on--even though I'm vague on how to do this.
But I'm not sure if this would be faster. Also, I don't know how to get
around using a regular expression for the control words. A control word
can be any length, and can take multiple forms:
'\pard ' => '\pard '
'\par\pard' => '\par', '\pard'
'\par\pard ' => '\par', '\pard '
'\par}' => '\par', '}'
Thanks for your help on using dictionaries. I believe your method may
save time, but I am waiting to learn how to use the profile module
before I can determine what is what regarding the problem I am having
with dictionaries.
Paul
--
************************
*Paul Tremblay *
*phthenry@earthlink.net*
************************