NLTK
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Aug 6 20:41:21 EDT 2018
On Fri, 03 Aug 2018 07:49:40 +0000, mausg wrote:
> I like to analyse text. my method consisted of something like
> words=text.split(), which would split the text into space-seperated
> units.
In natural language, words are more complicated than just space-separated
units. Some languages don't use spaces as a word delimiter. Some don't
use word delimiters at all. Even in English, the we have *compound words*
which exist in three forms:
- open: "ice cream"
- closed: "notebook"
- hyphenated: "long-term"
Recognising open compound words is difficult. "Real estate" is an open
compound word, but "real cheese" and "my estate" are both two words.
Another problem for English speakers is deciding whether to treat
contractions as a single word, or split them?
"don't" --> "do" "n't"
"they'll" --> "they" "'ll"
Punctuation marks should either be stripped out of sentences before
splitting into words, or treated as distinct tokens. We don't want
"tokens" and "tokens." to be treated as distinct words, just because one
happened to fall at the end of a sentence and one didn't.
> then I tried to use the Python NLTK library, which had alot of
> features I wanted, but using `word-tokenize' gives a different
> answer.-
>
> What gives?.
I'm pretty sure the function isn't called "word-tokenize". That would
mean "word subtract tokenize" in Python code. Do you mean word_tokenize?
Have you compared the output of the two and looked at how they differ? If
there is too much output to compare by eye, you could convert to sets and
check the set difference.
Or try reading the documentation for word_tokenize:
http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.treebank.TreebankWordTokenizer
--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson
More information about the Python-list
mailing list