Language detection module?

Fri Oct 22 11:21:30 EDT 1999

Fernando Pereira wrote:
> 
> In article <slrn80v4cb.49r.thantos at chancel.org>, Alexander Williams
> <thantos at chancel.org> wrote:
> 
> > On Thu, 21 Oct 1999 15:53:43 +0200, Dinu C. Gherman
> > <gherman at darwin.in-berlin.de> wrote:
> >
> > >is there anything already like a function that I can pass an
> > >arbitrary string and it will tell me wether it is written in
> > >English, French, German, etc.?
> >
> > The method you use below is part of what I call the 'fast, dumb and
> > happy' method of algorithms.  :)  (No shame there, I use it all the
> > time.)  It takes careful crafting, but its algorithmically simple.  If
> > you want something a bit more robust ...

True, just 10 minutes work, so don't expect more... ;-)

> The standard way of doing this doesn't involve clustering, which is
> hard, provided that you have training samples for each target language.
> From the training sample for language L, one builds a "language model"
> M[L] that estimates the probability M[L](S) of any string S according
> to the language. Given a test string T, guess its language to be the L
> such that M[L](T) is highest. One can build language models in many
> ways, but most of the simple ones involve n-gram statistics. There are
> certain subtleties on how to deal with test strings containing n-grams
> not found the training sample for some language. For details, consult

Yep, but clustering needs much sample data and even then, you 
will probably have to decide about the cluster shape and the
according best algorithm... from the few things I recall pretty
comlex.

I think in my case a "human pre-clustering", i.e. a method rough-
ly following my previous example (Quick and Dirty programming, if 
you want to call it that way) will do the job -- without any kind
of sample data, except some built-in "rules". 

Coming-back-with-some-examples-one-day-maybe'ly,

Dinu

-- 
Dinu C. Gherman
................................................................
Food for Echelon: Delta Force, SEAL, virtual, WASS, WID, Dolch,
secure shell, screws, Black-Ops, O/S, Area51, SABC, basement, 
ISWG, $@, data-haven, NSDD, black-bag, rack, TEMPEST, Goodwin, 
rebels, ID, MD5, IDEA, garbage, market, beef, Stego, ISAF, NARF, 
Manfurov, Kvashnin, Marx, Abdurahmon, snullen, Pseudonyms, MITM, 
Gray Data, VLSI, Leitrim... -- Visit http://www.hacktivism.org