[I18n-sig] Given a string, how do you know the encoding/charset ?

Tom Emerson tree@basistech.com
Wed, 22 Nov 2000 09:14:16 -0500


chas writes:
> Please forgive my ignorance but I haven't been able to find
> an answer to what I thought was going to be a basic question:
> Given a string of unknown origin, how does one calculate
> or find its character-set ?  Is there any such module or
> is there any way to use the existing codecs for this ? 

The general problem of encoding (and language) detection is an open
area of research. Basis has a commercial product which does it with
great accuracy. You can find various packages on the web to do similar
things, probably the most comprehensive that is the GPL'd TextCat,
available (in Perl) from

http://odur.let.rug.nl/~vannoord/TextCat/Demo/textcat.html

While TextCat claims to support 77 languages and encodings. It uses a
statistical method to determine which encoding and language are
represented by a given stream of bytes, but the data that it builds
its language models from is quite small (a few hundred characters of
each), which is a *very* small sample.

> Is the only way to recreate something like
> http://www.mandarintools.com/codeguess.html ?  And is there really
> so much need for 'guesswork' ?

Yes, there is. How else would you do it? If you know the language (or
encoding) a priori than finding the other is a bit easier. But the
general problem is difficult.

M.-A. Lemburg writes:
[...]
> There might still be some catches in this though, e.g. if the
> string source appears to use ASCII, but instead returns some
> superset of ASCII.

Indeed: how do you programmatically differentiate between ISO 8859-n,
1 <= n <= 9?

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Zenkaku Language Hacker                            http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"