Byte Offsets of Tokens, Ngrams and Sentences?
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Fri Aug 6 05:49:33 EDT 2010
En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabadeel at gmail.com>
escribió:
> Does any one know how to tokenize a string in python that returns the
> byte offsets and tokens? Moreover, the sentence splitter that returns
> the sentences and byte offsets? Finally n-grams returned with byte
> offsets.
>
> Input:
> This is a string.
>
> Output:
> This 0
> is 5
> a 8
> string. 10
Like this?
py> import re
py> s = "This is a string."
py> for g in re.finditer("\S+", s):
... print g.group(), g.start()
...
This 0
is 5
a 8
string. 10
--
Gabriel Genellina
More information about the Python-list
mailing list