[I18n-sig] Given a string, how do you know the encoding/charset ?

M.-A. Lemburg mal@lemburg.com
Wed, 22 Nov 2000 10:28:17 +0100

chas wrote:
> Hello to all,
> Please forgive my ignorance but I haven't been able to find
> an answer to what I thought was going to be a basic question:
> Given a string of unknown origin, how does one calculate
> or find its character-set ?  Is there any such module or
> is there any way to use the existing codecs for this ?
> Is the only way to recreate something like
> http://www.mandarintools.com/codeguess.html ?
> And is there really so much need for 'guesswork' ?

Encodings don't have a signature like e.g. many file types

The only way to try to correctly figure out the encoding is
by passing it through a set of codecs and then proof-reading
the output (in case the codec did not raise an exception).
There might still be some catches in this though, e.g. if the
string source appears to use ASCII, but instead returns some
superset of ASCII.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/