[Tutor] how to convert between type string and token
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Mon Nov 14 19:20:13 EST 2005
On Mon, 14 Nov 2005, enas khalil wrote:
> hello all
[program cut]
Hi Enas,
You may want to try talking with NTLK folks about this, as what you're
dealing with is a specialized subject. Also, have you gone through the
tokenization tutorial in:
http://nltk.sourceforge.net/tutorial/tokenization/nochunks.html#AEN276
and have you tried to compare your program to the ones in the tutorial's
examples?
Let's look at the error message.
> File "F:\MSC first Chapters\unigramgtag1.py", line 14, in -toplevel-
> for tok in train_tokens: mytagger.train(tok)
> File "C:\Python24\Lib\site-packages\nltk\tagger\__init__.py", line 324, in train
> assert chktype(1, tagged_token, Token)
> File "C:\Python24\Lib\site-packages\nltk\chktype.py", line 316, in chktype
> raise TypeError(errstr)
> TypeError:
> Argument 1 to train() must have type: Token
> (got a str)
This error message implies that each element in your train_tokens list is
a string and not a token.
The 'train_tokens' variable gets its values in the block of code:
###########################################
train_tokens = []
xx=Token(TEXT=open('fataha2.txt').read())
WhitespaceTokenizer().tokenize(xx)
for l in xx:
train_tokens.append(l)
###########################################
Ok. I see something suspicious here. The for loop:
######
for l in xx:
train_tokens.append(l)
######
assumes that we get tokens from the 'xx' token. Is this true? Are you
sure you don't have to specifically say:
######
for l in xx['SUBTOKENS']:
...
######
The example in the tutorial explicitely does something like this to
iterate across the subtokens of a token. But what you're doing instead is
to iterate across all the property names of a token, which is almost
certainly not what you want.
More information about the Python-list
mailing list