[Python-ideas] Python 3000 TIOBE -3%

Wed Feb 15 20:53:16 CET 2012

On 15/02/2012 6:51pm, Paul Moore wrote:
> Task: Process data in a file whose encoding I don't know
> Unicode Understanding Needed: Medium-Low
> Unicode Correctness: High
> Approach: Use external tools to identify the encoding, then simply
> specify it when opening the file. On Unix, "file -i FILENAME" will
> attempt to detect the encoding, on Windows, XXX. If, and only if, this
> approach doesn't identify the encoding clearly, then the other options
> allow you to do the best you can.

Don't recommend "file -i".

I just tried it on the files in /usr/share/libtextcat/ShortTexts/. 
Basically, everything is identified as us-ascii, iso-8859-1 or unknown-8bit.

Examples:

chinese-big5.txt:        text/plain; charset=iso-8859-1
chinese-gb2312.txt:      text/plain; charset=iso-8859-1
japanese-euc_jp.txt:     text/plain; charset=iso-8859-1
korean.txt:              text/plain; charset=iso-8859-1

arabic-windows1256.txt:  text/plain; charset=iso-8859-1
georgian.txt:            text/plain; charset=iso-8859-1
greek-iso8859-7.txt:     text/plain; charset=iso-8859-1
hebrew-iso8859_8.txt:    text/plain; charset=iso-8859-1
russian-windows1251.txt: text/plain; charset=iso-8859-1
ukrainian-koi8_r.txt:    text/plain; charset=iso-8859-1

sbt