Language detection module?

Alexander Williams thantos at chancel.org
Thu Oct 21 18:18:48 EDT 1999


On Thu, 21 Oct 1999 15:53:43 +0200, Dinu C. Gherman
<gherman at darwin.in-berlin.de> wrote:

>is there anything already like a function that I can pass an
>arbitrary string and it will tell me wether it is written in
>English, French, German, etc.? 

The method you use below is part of what I call the 'fast, dumb and
happy' method of algorithms.  :)  (No shame there, I use it all the
time.)  It takes careful crafting, but its algorithmically simple.  If
you want something a bit more robust ...

Take a 2 - 6 character sliding window, then snip the file into bits.
Don't worry about capitalization or punctuation, take the document
raw.  Extract these Ngrams and create a sort of vector of them, each
Ngram valued with its occurances.  Repeat for a large corpus of
documents in different languages.  Now, begin clustering the documents
based on the nearness of other documents in Ngrammatic space.  You'll
find all the documents of a given language tend to hang together (not
the least reason for which is that they tend to use the same phrases
other languages don't).  As a side effect, you'll likely cluster
documents about similar subjects together, but don't mind that right
now.  :)

-- 
Alexander Williams (thantos at gw.total-web.net)
"In the end ... Oblivion Always Wins."




More information about the Python-list mailing list