[Tutor] encode

Python python at venix.com
Wed Apr 19 05:41:55 CEST 2006


On Wed, 2006-04-19 at 10:10 +0700, kakada wrote:
> Hi again folks,
> 
> I wonder if we can check the encoding of text in one text file.
> user is free to encode the file whether Latin1, utf-8, ANSI...

> Any ideas?

def decode_file(filepath):
    '''Order of codecs is important.
    ASCII is most restrictive to decode - no byte values > 127.
    UTF8 is next most restrictive.  There are illegal byte values and illegal sequences.
    LATIN will accept anything since all 256 byte values are OK.
    The final decision still depends on human inspection.
    '''
    buff = open(filepath,'rb').read()
    for charset in (ASCII,UTF8,LATIN,):
        try:
            unistr = buff.decode(charset,'strict')
        except UnicodeDecodeError:
            pass
        else:
            break
    else:
        unistr,charset = u'',None
    return unistr, charset

Also note that the unicode character
	u'\ufffd'
represents an error placeholder.  It can be decoded from UTF8 inputs and
reflects earlier processing problems.


DO NOT USE THIS CODE BLINDLY.  It simply offers a reasonable, first cut
where those are the likely encodings.  It is impossible to distinguish
the various LATINx encodings by simply looking at bits.  All 8 bit bytes
are valid, but their meanings change based on the encoding used.

> 
> Thx
> 
> da
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
-- 
Lloyd Kvam
Venix Corp



More information about the Tutor mailing list