[Edu-sig] Python for Natural Language Processing

Steven Bird sb at csse.unimelb.edu.au
Wed Oct 17 02:46:24 CEST 2007

NLTK-Lite version 0.9 has been released -- http://nltk.org/index.php

NLTK -- the Natural Language Toolkit -- is a suite of open source
Python modules, data and documentation for research and development in
natural language processing. NLTK contains code supporting dozens of
NLP tasks, along with 30 popular corpora and extensive documentation
including a 360-page online book.  Distributions for Windows, Mac OSX
and Linux are available.  The toolkit has been used in 50+ university courses
in over 15 countries, and is in the top 0.1% of SourceForge projects
(32,000 downloads in the past 12 months).

Contents: NLTK consists of over 50k lines of Python code and 480Mb of data:

Corpora: Treebanks (English, Chinese, Dutch, Catalan, Spanish, Portuguese);
    POS-tagged corpora including the Brown Corpus; text corpora;
    PP attachment, named entity, WSD, TIMIT sample,
    Chat-80 database, WordNet, CMU Pronunciation Dictionary.
Tokenizers: whitespace, newline, blankline, word, wordpunct,
    treebank, regexp, Punkt sentence segmenter
Stemmers: Porter, Lancaster, regexp
Taggers: regexp, n-gram, backoff, Brill, HMM
Parsers: recursive descent, shift-reduce, chunk, chart,
    feature-based, probabilistic, ...
Semantic interpretation: untyped lambda calculus,
    first-order models, parser interface
Wordnet: wordnet interface, lexical relations, similarity
Classifiers: decision tree, maximum entropy, naive Bayes, Weka interface
Clusterers: expectation maximization, agglomerative, k-means
Evaluation: accuracy, precision, recall, F-measure, windowdiff
Estimation: uniform, maximum likelihood, Lidstone, Laplace,
    expected likelihood, heldout, cross-validation, Good-Turing, Witten-Bell
Miscellaneous: feature detection, unification, chatbots, many utilities

Changes: Version 0.9 is substantially revised and expanded from version 0.8.
The entire toolkit can be accessed via a single import statement
"import nltk", and there is a more convenient naming scheme. Calling
deprecated functions generates messages that help programmers update
their code. The corpus, tagger, and classifier modules have been
redesigned. All functionality of the old NLTK 1.4.3 is now covered by
NLTK-Lite 0.9. The book has been revised and expanded. A new data
package incorporates the existing corpus collection and contains new
sections for pre-specified grammars and pre-computed models. Several
new corpora have been added, including treebanks for Portuguese,
Spanish, Catalan and Dutch. A Macintosh distribution is provided.  For
full details of the changes, please see:

More information about the Edu-sig mailing list