[Tutor] best way to tokenize [was script too slow]
Paul Tremblay
phthenry@earthlink.net
Tue Feb 25 21:50:03 2003
Answering my own email. I'm still not totatlly sure of the line in
question, but I do realize that the '.join' is really: create a string
called "' '", and then use the method '.join' on that string.
Okay, now I do see the whole thing. The list to join is: first split the
token by "\\", which will get rid of the "\\", and then add the "\\" to
each item.
That's kind of a nice one liner to make tokens.
Thanks
Paul
On Tue, Feb 25, 2003 at 09:31:24PM -0500, Paul Tremblay wrote:
>
> Thanks. Your method is very instructive on how to use recursion. It is
> not quite perfect, since a line of tokens can look like:
>
> \par}}{\par{\ect => '\par}}' (should be '\par', '}', '}'
>
> However, it ends up that your method takes just a bit longer than using
> regular expressions, so there is probably no use in trying to perfect
> it. I did have one question about this line:
>
>
> > expandedWord = ' '.join(['\\'+item for item in
> > word.split('\\') if item])
>
> I get this much from it:
>
> 1. first python splits the word by the "\\".
>
> 2. Then ??? It joins them somehow. I'm not sure what the .join is.
>
> Thanks
>
> Paul
>
>
>
> On Tue, Feb 25, 2003 at 03:43:25PM +1000, Alfred Milgrom wrote:
> >
> > At 07:04 PM 24/02/03 -0500, Paul Tremblay wrote:
> > >However, I don't know if there is a better way to split a line of RTF.
> > >Here is a line of RTF that exhibits each of the main type of tokens:
>
> [snip]
>
> > Hi Paul:
> >
> > I can't say whether regular expressions are the best way to tokenise your
> > RTF input, but here is an alternative recursive approach.
> >
> > Each line is split into words (using spaces as the separator), and then
> > recursively split into sub-tokens if appropriate.
> >
> > def splitWords(inputline):
> > outputList = []
> > for word in inputline.split(' '):
> > if word.startswith('{') and word != '{':
> > expandedWord = '{' + ' ' + word[1:]
> > elif word.endswith('}')and word != '}' and word != '\\}':
> > expandedWord = word[:-1] + ' ' + '}'
> > elif '\\' in word and word != '\\':
> > expandedWord = ' '.join(['\\'+item for item in
> > word.split('\\') if item])
> > else:
> > expandedWord = word
> > if expandedWord != word:
> > expandedWord = splitWords(expandedWord)
> > outputList.append(expandedWord)
> > return ' '.join(outputList)
> >
> > example1 = 'text \par \\ \{ \} {}'
> >
> > print splitWords(example1)
> > >>> text \par \ \{ \} { }
> > print splitWords(example1).split(' ')
> > >>> ['text', '\\par', '\\', '\\{', '\\}', '{', '}']
> >
> > Seven different tokens seem to be identified correctly.
> >
> > example2 = 'text \par\pard \par} \\ \{ \} {differenttext}'
> > print splitWords(example2)
> > >>> text \par \ \{ \} { }
> > print splitWords(example2).split(' ')
> > >>> ['text', '\\par', '\\pard', '\\par', '}', '\\', '\\{', '\\}', '{',
> > 'differenttext', '}']
> >
> > Haven't tested exhaustively, but this seems to do what you wanted it to do.
> > As I said, I don't know if this will end up being better than using re or
> > not, but it is an alternative approach.
> >
> > Best regards,
> > Fred
> >
> >
> > _______________________________________________
> > Tutor maillist - Tutor@python.org
> > http://mail.python.org/mailman/listinfo/tutor
>
> --
>
> ************************
> *Paul Tremblay *
> *phthenry@earthlink.net*
> ************************
>
> _______________________________________________
> Tutor maillist - Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
--
************************
*Paul Tremblay *
*phthenry@earthlink.net*
************************