Detect character encoding
Martin P. Hellwig
mhellwig at xs4all.nl
Sun Dec 4 23:12:44 CET 2005
Mike Meyer wrote:
> "Diez B. Roggisch" <deets at nospam.web.de> writes:
>> Michal wrote:
>>> is there any way how to detect string encoding in Python?
>>> I need to proccess several files. Each of them could be encoded in
>>> different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>> and encode it to utf-8 (with string function encode).
>> But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
>> file is "legal" in all encodings.
> Not quite. Some encodings don't use all the valid 8-bit characters, so
> if you encounter a character not in an encoding, you can eliminate it
> from the list of possible encodings. This doesn't really help much by
> itself, though.
I read or heard (can't remember the origin) that MS IE has a quite good
implementation of guessing the language en character encoding of web
pages when there not or falsely specified.
From what I can remember is that they used an algorithm to create some
statistics of the specific page and compared that with statistic about
all kinds of languages and encodings and just mapped the most likely.
Please be aware that I don't know if the above has even the slightest
amount of truth in it, however it didn't prevent me from posting anyway ;-)
More information about the Python-list