Detect character encoding
Nemesis
nemesis at nowhere.invalid
Sun Dec 4 14:45:56 EST 2005
Mentre io pensavo ad una intro simpatica "Michal" scriveva:
> Hello,
> is there any way how to detect string encoding in Python?
> I need to proccess several files. Each of them could be encoded in
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and
> encode it to utf-8 (with string function encode).
> Thank you for any answer
Hi,
As you already heard you can't be sure but you can guess.
I use a method like this:
def guess_encoding(text):
for best_enc in guess_list:
try:
unicode(text,best_enc,"strict")
except:
pass
else:
break
return best_enc
'guess_list' is an ordered charset name list like this:
['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]
of course you can remove charsets you are sure you'll never find.
--
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.
|\ | |HomePage : http://nem01.altervista.org
| \|emesis |XPN (my nr): http://xpn.altervista.org
More information about the Python-list
mailing list