N-grams
Steve D'Aprano
steve+python at pearwood.info
Wed Nov 9 08:38:05 EST 2016
The documentation for the itertools has this nice implementation for a fast
bigram function:
from itertools import tee
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
https://docs.python.org/3/library/itertools.html#itertools-recipes
Which gives us an obvious trigram and 4-gram implementation:
def trigram(iterable):
a, b, c = tee(iterable, 3)
next(b, None)
next(c, None); next(c, None)
return zip(a, b, c)
def four_gram(iterable):
a, b, c, d = tee(iterable, 4)
next(b, None)
next(c, None); next(c, None)
next(d, None); next(d, None); next(d, None)
return zip(a, b, c, d)
And here's an implementation for arbitrary n-grams:
def ngrams(iterable, n=2):
if n < 1:
raise ValueError
t = tee(iterable, n)
for i, x in enumerate(t):
for j in range(i):
next(x, None)
return zip(*t)
Can we do better, or is that optimal (for any definition of optimal that you
like)?
--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list
mailing list