Determining the encoding of a text file
skip at pobox.com
Mon Mar 1 15:08:51 CET 2004
rajorshi> How do I determine the encoding of a text file ? That is,
rajorshi> given a text file I want to know the encoding it is in UTF8 or
rajorshi> UTF16 or Latin etc. It would be very helpful if you could tell
rajorshi> me how to do this in python on Linux. But just the method is
In general this is not possible. You can guess using heuristics, but there is
no predefined file attribute that indicates a file's encoding.
If you have a small set of candidate encodings you can generally do a decent
job guessing the encoding of a string by considering them in order. I placed
an example on my Python Bits page: <http://www.musi-cal.com/~skip/python/>. I
don't claim it's perfect and it's really only concerned with distiguishing
utf-8 and a few encodings which are similar to iso-8859-1, but it does a
decent job for me given the types of inputs I see.
More information about the Python-list