Language detection module?
Alexander Williams
thantos at chancel.org
Thu Oct 21 18:18:48 EDT 1999
On Thu, 21 Oct 1999 15:53:43 +0200, Dinu C. Gherman
<gherman at darwin.in-berlin.de> wrote:
>is there anything already like a function that I can pass an
>arbitrary string and it will tell me wether it is written in
>English, French, German, etc.?
The method you use below is part of what I call the 'fast, dumb and
happy' method of algorithms. :) (No shame there, I use it all the
time.) It takes careful crafting, but its algorithmically simple. If
you want something a bit more robust ...
Take a 2 - 6 character sliding window, then snip the file into bits.
Don't worry about capitalization or punctuation, take the document
raw. Extract these Ngrams and create a sort of vector of them, each
Ngram valued with its occurances. Repeat for a large corpus of
documents in different languages. Now, begin clustering the documents
based on the nearness of other documents in Ngrammatic space. You'll
find all the documents of a given language tend to hang together (not
the least reason for which is that they tend to use the same phrases
other languages don't). As a side effect, you'll likely cluster
documents about similar subjects together, but don't mind that right
now. :)
--
Alexander Williams (thantos at gw.total-web.net)
"In the end ... Oblivion Always Wins."
More information about the Python-list
mailing list