Determining the encoding of a text file

Skip Montanaro skip at pobox.com
Mon Mar 1 09:08:51 EST 2004


    rajorshi> How do I determine the encoding of a text file ? That is,
    rajorshi> given a text file I want to know the encoding it is in UTF8 or
    rajorshi> UTF16 or Latin etc. It would be very helpful if you could tell
    rajorshi> me how to do this in python on Linux. But just the method is
    rajorshi> acceptable.

In general this is not possible.  You can guess using heuristics, but there is
no predefined file attribute that indicates a file's encoding.

If you have a small set of candidate encodings you can generally do a decent
job guessing the encoding of a string by considering them in order.  I placed
an example on my Python Bits page: <http://www.musi-cal.com/~skip/python/>.  I
don't claim it's perfect and it's really only concerned with distiguishing
utf-8 and a few encodings which are similar to iso-8859-1, but it does a
decent job for me given the types of inputs I see.

Skip




More information about the Python-list mailing list