best split tokens?
Tim Chase
python.list at tim.thechases.com
Fri Sep 8 22:02:54 EDT 2006
>> rgx = re.compile('\W+')
>>
>> if you don't mind numbers included you text (in the event you
>> have things like "fatal1ty", "thing2", or "pdf2txt") which is
>> often the case...they should be considered part of the word.
>>
>> If that's a problem, you should be able to use
>>
>> rgx = re.compile('[^a-zA-Z]+')
>>
>> This is a bit Euro-centric...
>
> I'd call it half-asscii :-)
groan... :)
Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.
> textbox = "He was wont to be alarmed/amused by answers that won't work"
Well, one could do something like
>>> s
"He was wont to be alarmed/amused by answers that won't work"
>>> s2
"The two-faced liar--a real joker--can't tell the truth"
>>> r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
>>> r.findall(s), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...
I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)
Any more crazy examples? :)
-tkc
More information about the Python-list
mailing list