Extracting "true" words
nagle at animats.com
Sat Apr 2 06:04:55 CEST 2011
On 4/1/2011 4:10 PM, Chris Rebert wrote:
> On Fri, Apr 1, 2011 at 1:55 PM, candide<candide at free.invalid> wrote:
>> Back again with my study of regular expressions ;) There exists a special
>> character allowing alphanumeric extraction, the special character \w (BTW,
>> what the letter 'w' refers to?).
> "Word" presumably/intuitively; hence the non-standard "[:word:]"
> POSIX-like character class alias for \w in some environments.
>> But this feature doesn't permit to extract
>> true words; by "true" I mean word composed only of _alphabetic_ letters (not
>> digit nor underscore).
> Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
> And what of hyphenated terms (e.g. "re-lock")?
It's an interesting parsing problem to find word breaks in mixed
language text. It's quite common to find English and Japanese text
mixed. (See "http://www.dokidoki6.com/00_index1.html". Caution,
excessively cute.) Each ideograph is a "word", of course.
Parse this into words:
6%DOKIDOKI VISUAL FILE vol.4を公開しました。
More information about the Python-list