best split tokens?

Nick Vatamaniuc vatamane at gmail.com
Sat Sep 9 02:16:57 EDT 2006


 It depends on the language as it was suggested, and it also depends on
how a token is defined.  Can it have dashes, underscores, numbers and
stuff? This will also determine what the whitespace will be. Then the
two main methods of doing the splitting is to either cut based on
whitespace (specify whitespace explicitly) or pick out only valid token
symbols uninterrupted by any whitespace (specify valid symbols
explicitly).

Nick V.


Tim Chase wrote:
> >> 	rgx = re.compile('\W+')
> >>
> >> if you don't mind numbers included you text (in the event you
> >> have things like "fatal1ty", "thing2", or "pdf2txt") which is
> >> often the case...they should be considered part of the word.
> >>
> >> If that's a problem, you should be able to use
> >>
> >> 	rgx = re.compile('[^a-zA-Z]+')
> >>
> >> This is a bit Euro-centric...
> >
> > I'd call it half-asscii :-)
>
> groan... :)
>
> Given the link you provided, I correct my statement to
> "Ango-centric", as there are clearly oddball cases in languages
> such as French.
>
> > textbox = "He was wont to be alarmed/amused by answers that won't work"
>
> Well, one could do something like
>
>  >>> s
> "He was wont to be alarmed/amused by answers that won't work"
>  >>> s2
> "The two-faced liar--a real joker--can't tell the truth"
>  >>> r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
>  >>> r.findall(s), r.findall(s2)
> (['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
> 'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
> 'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
>
>
> which parses your example the way I would want it to be parsed,
> and handles the strange string I came up with to try similar
> examples the way I would expect that it would be broken down by
> "words"...
>
> I had a hard time comin' up with any words I'd want to call
> "words" where the additional non-word glyph (apostrophe, dash,
> etc) wasn't 'round the middle of the word. :)
> 
> Any more crazy examples? :)
> 
> -tkc




More information about the Python-list mailing list