[Tutor] i want to build my own arabic training corpus data and use the NLTK to deal with

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Aug 3 19:36:08 CEST 2005

On Wed, 3 Aug 2005, enas khalil wrote:

> i want to build my own arabic training corpus data and use the NLTK to
> parse and make test for unkown data

Hi Enas,

By NLTK, I'll assume that you mean the Natural Language Toolkit at:


Have you gone through the introduction and tutorials from the NLTK web


> how can i build this file and make it available to treat with it using
> different NLTK classes

Your question is a bit specialized, so we may not be the best people to
ask about this.

The part that you may want to think about is how to break a corpus into a
sequence of tokens, since tokens are primarily what the NLTK classes work

This may or may not be immediately easy, depending on how much you can
take advantage of existing NLTK classes.  As the documentation in NLTK

"""If we turn to languages other than English, segmenting words can be
even more of a challenge. For example, in Chinese orthography, characters
correspond to monosyllabic morphemes. Many morphemes are words in their
own right, but many words contain more than one morpheme; most of them
consist of two morphemes. However, there is no visual representation of
word boundaries in Chinese text."""

I don't know how Arabic works, so I'm not sure if the caveat above is
something that we need to worry about.

There are a few built-in NLTK tokenizers that break a corpus into tokens,
including a WhitespaceTokenizer and a RegexpTokenizer class, both
introduced here:


For example:

>>> import nltk.token
>>> mytext = nltk.token.Token(TEXT="hello world this is a test")
>>> mytext
<hello world this is a test>

At the moment, this is a single token.  We can use a naive approach in
breaking this into words by using whitespace as our delimiter:

>>> import nltk.tokenizer
>>> nltk.tokenizer.WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(mytext)
>>> mytext
<[<hello>, <world>, <this>, <is>, <a>, <test>]>

And now our text is broken into a sequence of discrete tokens, where we
can now play with the 'subtokens' of our text:

>>> mytext['WORDS']
[<hello>, <world>, <this>, <is>, <a>, <test>]
>>> len(mytext['WORDS'])

If Arabic follows conventions that fit closely with the assumptions of
those tokenizers, you should be in good shape.  Otherwise, you'll probably
have to do some work to build your own customized tokenizers.

More information about the Tutor mailing list