Language detection with python

Jeremiah Dodds jeremiah.dodds at gmail.com
Fri Apr 17 11:14:11 EDT 2009


On Fri, Apr 17, 2009 at 3:19 PM, S.Selvam <s.selvamsiva at gmail.com> wrote:

> Hi all,
>
> I am trying for language detection in python.I just need to check whether
> the input text is english or not.
>
> 1)I tried nltk's stopwords and compared with input text,but only with
> little success.
>
> 2)Used oice.langdet for language detection,which uses bi-gram approach.It
> is also inefficient.
>
> I need a best way to detect english text .
>
> I welcome your suggestions ...
> --
> Yours,
> S.Selvam
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>

I don't know anything about language detection, but my first attempt would
be something like:

Grab the first N words (space-separated) from whatever file you're trying to
check
Find out what percentage of them, if any, are in some dictionary file, say
/usr/share/dict/american-english on Ubuntu linux.

If there's a high percentage of words found, it's more than likely english.

Or, perhaps checking for some commonly used words in english that only
appear in english. I'm not aware of any examples off the top of my head, as
I only know one language, but I'm sure there are some common english words
that are mostly unique to the language.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090417/d399eb36/attachment.html>


More information about the Python-list mailing list