[Tutor] best way to tokenize [was script too slow]

Tue Feb 25 21:50:03 2003

Answering my own email. I'm still not totatlly sure of the line in
question, but I do realize that the '.join' is really: create a string
called "' '", and then use the method '.join' on that string. 

Okay, now I do see the whole thing. The list to join is: first split the
token by "\\", which will get rid of the "\\", and then add the "\\" to
each item.

That's kind of a nice one liner to make tokens.

Thanks

Paul

On Tue, Feb 25, 2003 at 09:31:24PM -0500, Paul Tremblay wrote:
> 
> Thanks. Your method is very instructive on how to use recursion. It is
> not quite perfect, since a line of tokens can look like:
> 
> \par}}{\par{\ect => '\par}}' (should be '\par', '}', '}'
> 
> However, it ends up that your method takes just a bit longer than using
> regular expressions, so there is probably no use in trying to perfect
> it. I did have one question about this line:
> 
> 
> >             expandedWord =  ' '.join(['\\'+item for item in 
> > word.split('\\') if item])
> 
> I get this much from it:
> 
> 1. first python splits the word by the "\\".
> 
> 2. Then ??? It joins them somehow. I'm not sure what the .join is.
> 
> Thanks
> 
> Paul
> 
> 
> 
> On Tue, Feb 25, 2003 at 03:43:25PM +1000, Alfred Milgrom wrote:
> > 
> > At 07:04 PM 24/02/03 -0500, Paul Tremblay wrote:
> > >However, I don't know if there is a better way to split a line of RTF.
> > >Here is a line of RTF that exhibits each of the main type of tokens:
> 
> [snip]
> 
> > Hi Paul:
> > 
> > I can't say whether regular expressions are the best way to tokenise your 
> > RTF input, but here is an alternative recursive approach.
> > 
> > Each line is split into words (using spaces as the separator), and then 
> > recursively split into sub-tokens if appropriate.
> > 
> > def splitWords(inputline):
> >     outputList = []
> >     for word in inputline.split(' '):
> >         if word.startswith('{') and word != '{':
> >             expandedWord = '{' + ' ' + word[1:]
> >         elif word.endswith('}')and word != '}' and word != '\\}':
> >             expandedWord = word[:-1] + ' ' + '}'
> >         elif '\\' in word and word != '\\':
> >             expandedWord =  ' '.join(['\\'+item for item in 
> > word.split('\\') if item])
> >         else:
> >             expandedWord = word
> >         if expandedWord != word:
> >             expandedWord = splitWords(expandedWord)
> >         outputList.append(expandedWord)
> >     return ' '.join(outputList)
> > 
> > example1 = 'text \par \\ \{ \} {}'
> > 
> > print splitWords(example1)
> > >>> text \par \ \{ \} { }
> > print splitWords(example1).split(' ')
> > >>> ['text', '\\par', '\\', '\\{', '\\}', '{', '}']
> > 
> > Seven different tokens seem to be identified correctly.
> > 
> > example2 = 'text \par\pard \par} \\ \{ \} {differenttext}'
> > print splitWords(example2)
> > >>> text \par \ \{ \} { }
> > print splitWords(example2).split(' ')
> > >>> ['text', '\\par', '\\pard', '\\par', '}', '\\', '\\{', '\\}', '{', 
> > 'differenttext', '}']
> > 
> > Haven't tested exhaustively, but this seems to do what you wanted it to do.
> > As I said, I don't know if this will end up being better than using re or 
> > not, but it is an alternative approach.
> > 
> > Best regards,
> > Fred
> > 
> > 
> > _______________________________________________
> > Tutor maillist  -  Tutor@python.org
> > http://mail.python.org/mailman/listinfo/tutor
> 
> -- 
> 
> ************************
> *Paul Tremblay         *
> *phthenry@earthlink.net*
> ************************
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************