Announcing the Release of WordSegment Version 0.5.2 What is WordSegment? ------------------------- WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus. Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009). Data files are derived from the Google Web Trillion Word Corpus. It's implemented in pure-Python with 100% code coverage and complete documentation. What's new in 0.5.2? -------------------- - Updated documentation with tutorial and API reference. - Removed unigrams longer than 24 characters. - Bug Fix: Converted all bigrams to lowercase. - Bug Fix: Sanitize input text by lowercasing and removing non-alphanumerics. - Replaced "memoize" decorator with local memo-dict for server use. Links ----- - Documentation: http://www.grantjenks.com/docs/wordsegment/ - Download: https://pypi.python.org/pypi/wordsegment - Source: https://github.com/grantjenks/wordsegment - Issues: https://github.com/grantjenks/wordsegment/issues This release is backwards-compatible. Please upgrade.
participants (1)
-
Grant Jenks