N-grams
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Nov 10 03:01:15 EST 2016
On Thursday 10 November 2016 17:53, Wolfram Hinderer wrote:
[...]
> 1. The startup looks slightly ugly to me.
> 2. If n is large, tee has to maintain a lot of unnecessary state.
But n should never be large.
If practice, n-grams are rarely larger than n=3. Occasionally you might use n=4
or even n=5, but I can't imagine using n=20 in practice, let alone the example
you show of n=500.
See, for example:
http://stackoverflow.com/a/10382221
In practice, large n n-grams run into three problems:
- for word-based n-grams, n=3 is about the maximum needed;
- for other applications, n can be moderately large, but as n-grams are a
kind of auto-correlation function, and few data sets are auto-correlated
*that* deeply, you still rarely need large values of n;
- there is the problem of sparse data and generating a good training corpus.
For n=10, and just using ASCII letters (lowercase only), there are 26**10 =
141167095653376 possible 10-grams. Where are you going to find a text that
includes more than a tiny fraction of those?
--
Steven
299792.458 km/s — not just a good idea, it’s the law!
More information about the Python-list
mailing list