[Tutor] join - was best way to tokenize [was script too slow]

Michael Janssen Janssen@rz.uni-frankfurt.de
Wed Feb 26 11:08:02 2003


On Tue, 25 Feb 2003, Paul Tremblay wrote:

> Answering my own email. I'm still not totatlly sure of the line in
> question, but I do realize that the '.join' is really: create a string
> called "' '", and then use the method '.join' on that string.

"join" is here the string-method "join". It's aequivalent to the module's
function string.join(): Every (?) function from module string was also
made into a string-method since a recent version of Python (2.0?).

But join should better not have made this evolution :-( Because:

>>> string.join('qwert', ' ')
'q w e r t'

is (might be ;-) what you expected: 'qwert' is the string to work about
and ' ' is the string used to manipulate 'qwert'.

The string-method toggles this:

>>> 'qwert'.join(' ')  # wrong in most cases
' '
>>> ' '.join('qwert')
'q w e r t'

---> instead of using a method as the method of the string you are working
on you make an "auxiliary" string and give your "working-string" as a
parameter. There might be good reasons for this, but that's not
python-like syntax as readable as natural language.

Once you know that ' '.join() behaves a little different it's realy fun
to use it ;-)

Michael

>
> Okay, now I do see the whole thing. The list to join is: first split the
> token by "\\", which will get rid of the "\\", and then add the "\\" to
> each item.
>
> That's kind of a nice one liner to make tokens.
>
> Thanks
>
> Paul
>
> On Tue, Feb 25, 2003 at 09:31:24PM -0500, Paul Tremblay wrote:
> >
> > Thanks. Your method is very instructive on how to use recursion. It is
> > not quite perfect, since a line of tokens can look like:
> >
> > \par}}{\par{\ect => '\par}}' (should be '\par', '}', '}'
> >
> > However, it ends up that your method takes just a bit longer than using
> > regular expressions, so there is probably no use in trying to perfect
> > it. I did have one question about this line:
> >
> >
> > >             expandedWord =  ' '.join(['\\'+item for item in
> > > word.split('\\') if item])
> >
> > I get this much from it:
> >
> > 1. first python splits the word by the "\\".
> >
> > 2. Then ??? It joins them somehow. I'm not sure what the .join is.
> >
> > Thanks
> >
> > Paul
> >
> >
> >
> > On Tue, Feb 25, 2003 at 03:43:25PM +1000, Alfred Milgrom wrote:
> > >
> > > At 07:04 PM 24/02/03 -0500, Paul Tremblay wrote:
> > > >However, I don't know if there is a better way to split a line of RTF.
> > > >Here is a line of RTF that exhibits each of the main type of tokens:
> >
> > [snip]
> >
> > > Hi Paul:
> > >
> > > I can't say whether regular expressions are the best way to tokenise your
> > > RTF input, but here is an alternative recursive approach.
> > >
> > > Each line is split into words (using spaces as the separator), and then
> > > recursively split into sub-tokens if appropriate.
> > >
> > > def splitWords(inputline):
> > >     outputList = []
> > >     for word in inputline.split(' '):
> > >         if word.startswith('{') and word != '{':
> > >             expandedWord = '{' + ' ' + word[1:]
> > >         elif word.endswith('}')and word != '}' and word != '\\}':
> > >             expandedWord = word[:-1] + ' ' + '}'
> > >         elif '\\' in word and word != '\\':
> > >             expandedWord =  ' '.join(['\\'+item for item in
> > > word.split('\\') if item])
> > >         else:
> > >             expandedWord = word
> > >         if expandedWord != word:
> > >             expandedWord = splitWords(expandedWord)
> > >         outputList.append(expandedWord)
> > >     return ' '.join(outputList)
> > >
> > > example1 = 'text \par \\ \{ \} {}'
> > >
> > > print splitWords(example1)
> > > >>> text \par \ \{ \} { }
> > > print splitWords(example1).split(' ')
> > > >>> ['text', '\\par', '\\', '\\{', '\\}', '{', '}']
> > >
> > > Seven different tokens seem to be identified correctly.
> > >
> > > example2 = 'text \par\pard \par} \\ \{ \} {differenttext}'
> > > print splitWords(example2)
> > > >>> text \par \ \{ \} { }
> > > print splitWords(example2).split(' ')
> > > >>> ['text', '\\par', '\\pard', '\\par', '}', '\\', '\\{', '\\}', '{',
> > > 'differenttext', '}']
> > >
> > > Haven't tested exhaustively, but this seems to do what you wanted it to do.
> > > As I said, I don't know if this will end up being better than using re or
> > > not, but it is an alternative approach.
> > >
> > > Best regards,
> > > Fred
> > >
> > >
> > > _______________________________________________
> > > Tutor maillist  -  Tutor@python.org
> > > http://mail.python.org/mailman/listinfo/tutor
> >
> > --
> >
> > ************************
> > *Paul Tremblay         *
> > *phthenry@earthlink.net*
> > ************************
> >
> > _______________________________________________
> > Tutor maillist  -  Tutor@python.org
> > http://mail.python.org/mailman/listinfo/tutor
>
> --
>
> ************************
> *Paul Tremblay         *
> *phthenry@earthlink.net*
> ************************
>
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>