NLTK
mausg at mail.com
mausg at mail.com
Wed Aug 8 17:03:58 EDT 2018
On 2018-08-07, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>>In natural language, words are more complicated than just space-separated
>>units. Some languages don't use spaces as a word delimiter.
>
> Even above, the word »units« is neither directly preceded
> nor directly followed by a space.
>
> In the end, one can make an arbitrary choice about where one
> wants to place the border between syntax and morphology.
>
> For the case of English, I can define a word to be a
> sequence of letters (including the apostrophe), that is
> sorrounded by non-letter characters.
>
>>Recognising open compound words is difficult. "Real estate" is an open
>>compound word, but "real cheese" and "my estate" are both two words.
>
> This is just a part of the more general problem to parse and
> interpret a sentence. It is not more difficult than the
> interpretation of other pairs of words in a sentence.
>
>>Another problem for English speakers is deciding whether to treat
>>contractions as a single word, or split them?
>>"don't" --> "do" "n't"
>>"they'll" --> "they" "'ll"
>
> They are a single word by my definition. But this is just
> the surface of the input. The input could be translated into
> a "deep-structure" intermediate language that than splits
> some source words into several units or joins some source
> words into a single unit.
>
>>Punctuation marks should either be stripped out of sentences before
>>splitting into words, or treated as distinct tokens. We don't want
>>"tokens" and "tokens." to be treated as distinct words, just because one
>>happened to fall at the end of a sentence and one didn't.
>
> Yes, but this is quite trivial compared to the problem
> of parsing and interpreting a natural-language sentence.
>
Thanks all for the replies. It seems that I do not really need NLTK.
split() will do me. Again Thanks
--
Maus at ireland.com
Will Rant For Food
More information about the Python-list
mailing list