recycling internationalized garbage
garabik-news-2005-05 at kassiopeia.juls.savba.sk
garabik-news-2005-05 at kassiopeia.juls.savba.sk
Wed Mar 8 10:04:59 EST 2006
Fredrik Lundh <fredrik at pythonware.com> wrote:
> "aaronwmail-usenet at yahoo.com" wrote:
>
>> Question: what is a good strategy for taking an 8bit
>> string of unknown encoding and recovering the largest
>> amount of reasonable information from it (translated to
>> utf8 if needed)? The string might be in any of the
>> myriad encodings that predate unicode. Has anyone
>> done this in Python already? The output must be clean
>> utf8 suitable for arbitrary xml parsers.
>
> some alternatives:
>
> braindead bruteforce:
>
> try to do strict decoding as utf-8. if you succeed, you have an utf-8
> string. if not, assume iso-8859-1.
that was a mistake I made once.
Do not use iso8859-1 as python codec, instead create your own codec
called e.g. iso8859-1-ncc like this (just a sketch):
decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256))
decoding_map.update({})
encoding_map = codecs.make_encoding_map(decoding_map)
and then use :
def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"
for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError, r:
pass
return None
guessed_unicode_text = try_encodings(text, ['utf-8', 'iso8859-1-ncc', 'cp1252', 'macroman'])
it seems to work surprisingly well, if you know approximately the
language(s) the text is expected to be in (e.g. replace cp1252 with
cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages)
--
-----------------------------------------------------------
| Radovan GarabĂk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
More information about the Python-list
mailing list