ANN: WordSegment 0.5.2 Released

Grant Jenks grant.jenks at
Thu Sep 3 00:27:46 CEST 2015

Announcing the Release of WordSegment Version 0.5.2

What is WordSegment?

WordSegment is an Apache2 licensed module for English word segmentation,
written in pure-Python, and based on a trillion-word corpus. Based on code from
the chapter “Natural Language Corpus Data” by Peter Norvig from the book
“Beautiful Data” (Segaran and Hammerbacher, 2009). Data files are derived from
the Google Web Trillion Word Corpus. It's implemented in pure-Python with 100%
code coverage and complete documentation.

What's new in 0.5.2?

- Updated documentation with tutorial and API reference.
- Removed unigrams longer than 24 characters.
- Bug Fix: Converted all bigrams to lowercase.
- Bug Fix: Sanitize input text by lowercasing and removing non-alphanumerics.
- Replaced "memoize" decorator with local memo-dict for server use.


- Documentation:
- Download:
- Source:
- Issues:

This release is backwards-compatible. Please upgrade.

More information about the Python-announce-list mailing list