Determining the encoding of a text file
Skip Montanaro
skip at pobox.com
Mon Mar 1 09:08:51 EST 2004
rajorshi> How do I determine the encoding of a text file ? That is,
rajorshi> given a text file I want to know the encoding it is in UTF8 or
rajorshi> UTF16 or Latin etc. It would be very helpful if you could tell
rajorshi> me how to do this in python on Linux. But just the method is
rajorshi> acceptable.
In general this is not possible. You can guess using heuristics, but there is
no predefined file attribute that indicates a file's encoding.
If you have a small set of candidate encodings you can generally do a decent
job guessing the encoding of a string by considering them in order. I placed
an example on my Python Bits page: <http://www.musi-cal.com/~skip/python/>. I
don't claim it's perfect and it's really only concerned with distiguishing
utf-8 and a few encodings which are similar to iso-8859-1, but it does a
decent job for me given the types of inputs I see.
Skip
More information about the Python-list
mailing list