On 15/02/2012 6:51pm, Paul Moore wrote:
Task: Process data in a file whose encoding I don't know Unicode Understanding Needed: Medium-Low Unicode Correctness: High Approach: Use external tools to identify the encoding, then simply specify it when opening the file. On Unix, "file -i FILENAME" will attempt to detect the encoding, on Windows, XXX. If, and only if, this approach doesn't identify the encoding clearly, then the other options allow you to do the best you can.
Don't recommend "file -i". I just tried it on the files in /usr/share/libtextcat/ShortTexts/. Basically, everything is identified as us-ascii, iso-8859-1 or unknown-8bit. Examples: chinese-big5.txt: text/plain; charset=iso-8859-1 chinese-gb2312.txt: text/plain; charset=iso-8859-1 japanese-euc_jp.txt: text/plain; charset=iso-8859-1 korean.txt: text/plain; charset=iso-8859-1 arabic-windows1256.txt: text/plain; charset=iso-8859-1 georgian.txt: text/plain; charset=iso-8859-1 greek-iso8859-7.txt: text/plain; charset=iso-8859-1 hebrew-iso8859_8.txt: text/plain; charset=iso-8859-1 russian-windows1251.txt: text/plain; charset=iso-8859-1 ukrainian-koi8_r.txt: text/plain; charset=iso-8859-1 sbt