[soc2008-general] SoC project

Paulo Malvar paulomal at gmail.com
Fri Mar 28 01:54:18 CET 2008


Hi everybody,

I'm sending my project proposal to see of anyone has comments or
suggestions.


i) What project:

This project intends the development of a language identifier, implemented
in Python, which will classify texts among several closely related languages
to include within the NLTK package.
    Differently from the text classifiers already included in NLTK, which
classify texts according to topic, a language identifiers model a different
type object, languages, or rather texts as instances of particular
languages. From the point of view of what features are inherent to the type
of objects they model, language identifiers are clearly different from text
classifiers with respect to the feature selection and feature engineering
strategies they must implement.

ii) Why this project:

Language identification is one of the most important topics in Natural
Language Processing (NLP). Thus, any NLP application intended to process
multilingual resources will depend on a module that first determines what
language a particular resource represents.
    If language identification for highly differentiated languages could be
a rather simple task, language identification for closely related languages
represents a challenge that requires a huge feature engineering effort to be
accomplished reliably and accurately that identification task. Thus,
linguistic closeness means that, in contrast to what occurs among highly
differentiated languages, among closely related languages there are a large
number of overlapping features that must be identified and, therefore,
avoided for the purposes of distinguishing those languages.
    The task of identifying what type of features are more suitable to
distinguish between closely related languages requires a deep understanding
of the linguistic structure of those languages: their orthography,
morphology and syntax. Only with this type of understanding highly
distinctive features can be made arise and, therefore, used to train
accurate classification models.
    On the one hand, since NLTK lacks any kind of language identification
module, this projects results of particular interest from the point of view
of completeness of this popular Natural Language Toolkit.
    On the other hand, since NLTK represents one of the most used NLP
toolkits among the research and educational community, it is of particular
importance to point out that the Python community will also benefit from the
completeness of such a popular toolkit. Thus, the development of a high
quality language identifier for closely relate language for NLTK will help
to make this toolkit become an standard for NLP and Information Retrieval
courses, spreading as a direct side effect the usage of Python as a
programming language for NLP projects.

iii) How to carry out this project:

The realization of this project will be carried out in four differentiated
stages:

1- Compilation of a languages-specific corpus:

Although this project intends the development of general, that is,
non-language specific, language identification module, at least for purposes
of testing and evaluation this project will focus on the collection of
training data for three particular closely related languages: Galician,
Portuguese and Spanish (all of them spoken in the Iberian Peninsula and with
a spread representation all over the world).
    In this sense, the collection of data will be carried out by extraction
pieces of news from four different on-line newspapers:

- www.vieiros.com (for Galician)
- www.elpais.com (for Spanish)
- www.publico.clix.pt and http://dn.sapo.pt (for Portuguese)

This stage will be one of the most time consuming tasks of this project. The
estimation is that the stage will be accomplish in around two or three
weeks.

2- Preprocessing of the data:

The stage intends the normalization of the collected data in order to reduce
the amount of data necessary to carry out this project. In particular two
preprocessing tasks will be carried out: lowercasing and tokenization.
Prototypes of localization algorithms have been already developed for the
completion of several of the previous projects I have been involved in.
Therefore, this task is estimated not to take longer than a week.

3- Feature selection:

As pointed out, feature selection is one of the most important tasks in the
design and development of language identifiers.
    Since this project intends the implementation of several decision
algorithms, with particular strengths and drawbacks, this project is open to
the selection of any type of feature, such as, for instance, pure or hybrid
word-base and/or character-based n-grams.

4- Machine Learning Algorithms Implementation:

In combination with own implementations of Machine Learning (ML) algorithms
this project pretends to take advantage of already existing python
implementations of particular ML algorithms, so that the NLTK project can be
expanded and benefit from new and existing python implementations of those
algorithms.
    In particular this project intends to include the following ML
algorithms: Event-based Bayes Naive, Maximum Entropy and Support Vector
Machines.
    Since stages 3 and 4 will be carried out in a dynamic process of
implementation, these two stages are intended to represent the core of this
project and to take around 5 and 6 weeks.

5- Evaluation:

The evaluation of the performance of the different language classification
algorithms will be carried out by using a collection of unseen
language-specific texts, which will also collected during the already
described stage 1.
    To find the benefits of each of the algorithms, testing results of their
performance will be collected by training different models using training
data of different sizes. A final study of statistical significance among
each of those models will be performed.
    This final stage will be performed during the last two weeks of the
coding period for the Google Summer of Code '08.

iv) Why Paulo Malvar Fernández as the developer of this project:

Given the type of specialized knowledge necessary to accomplished this
project, I believe I am the appropriate person to carry it out from three
main reasons.
    First, regarding the compilation of the data necessary to train ML
models, I have experience in compiling corpora:

- two literary parallel corpora, one of which was compiled during my
internship at Imaxin Software and a second one was compiled as part of my
ongoing Ph.D. research at the University of Santiago de Compostela.
- one software localization parallel corpus, which was compiled for my
Master´s thesis research at San Diego State university (SDSU).

Second, regarding the selection of features to train the ML models, besides
being a proficient speaker of Galician, Portuguese and Spanish, my
undergraduate degree, Portuguese Philology, provided me a deep theoretical
knowledge of those languages. Therefore, I can take advantage of this
pragmatic and theoretical knowledge to select appropriate features that will
help the different ML algorithms to learn efficient language identification
models.
    Finally, given the formation in Computational Linguistics I acquired
during my Master in Computational Linguistics at SDSU, I have the necessary
knowledge to formalized the linguistic data into suitable data structures
for the different ML algorithms I intend to implement.
    Furthermore, Python was the programming language in which we focused in
this Master. Therefore, I have the experience working with Python necessary
to carry out this project.


Take care everybody!!

-- 
Paulo Malvar Fernández

Research Assistant of the SDSU Computational Linguistics Lab

http://paulomalvar.homeunix.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/soc2008-general/attachments/20080327/5e3ae9a0/attachment-0001.htm 


More information about the soc2008-general mailing list