[Tutor] NLTK

Kent Johnson kent37 at tds.net
Sat Aug 29 12:34:09 CEST 2009


On Fri, Aug 28, 2009 at 10:16 PM, Ishan Puri<ballerz4ishi at sbcglobal.net> wrote:

>>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>>> len(emma)
> 192427
>
> So this is the number of words in a particular 'austen-emma.txt'. How would
> I do this
> with my IM50re.txt? It
>  seems the code "nltk.corpus.gutenberg.words" is specific to some Gutenberg
> corpus installed with NLTK.
> Like this many examples are given for different analyses that can be done
> with NLTK. However they all seem to be specific
> to one of the texts above or another one already installed with NLTK. I am
> not sure how to apply these examples to my own corpus.

This is pretty much the next line in the "Loading your own Corpus"
example. After
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root='C:\Users\Ishan\Documents'
>>> wordlists = PlaintextCorpusReader(corpus_root, 'IM50re.txt')
>>> wordlists.fileids()
['IM50re.txt']

you should be able to do
my_words = wordlists.words('IM50re.txt')
len(my_words)

Kent


More information about the Tutor mailing list