[Python-ideas] Python 3000 TIOBE -3%
shibturn
shibturn at gmail.com
Wed Feb 15 20:53:16 CET 2012
On 15/02/2012 6:51pm, Paul Moore wrote:
> Task: Process data in a file whose encoding I don't know
> Unicode Understanding Needed: Medium-Low
> Unicode Correctness: High
> Approach: Use external tools to identify the encoding, then simply
> specify it when opening the file. On Unix, "file -i FILENAME" will
> attempt to detect the encoding, on Windows, XXX. If, and only if, this
> approach doesn't identify the encoding clearly, then the other options
> allow you to do the best you can.
Don't recommend "file -i".
I just tried it on the files in /usr/share/libtextcat/ShortTexts/.
Basically, everything is identified as us-ascii, iso-8859-1 or unknown-8bit.
Examples:
chinese-big5.txt: text/plain; charset=iso-8859-1
chinese-gb2312.txt: text/plain; charset=iso-8859-1
japanese-euc_jp.txt: text/plain; charset=iso-8859-1
korean.txt: text/plain; charset=iso-8859-1
arabic-windows1256.txt: text/plain; charset=iso-8859-1
georgian.txt: text/plain; charset=iso-8859-1
greek-iso8859-7.txt: text/plain; charset=iso-8859-1
hebrew-iso8859_8.txt: text/plain; charset=iso-8859-1
russian-windows1251.txt: text/plain; charset=iso-8859-1
ukrainian-koi8_r.txt: text/plain; charset=iso-8859-1
sbt
More information about the Python-ideas
mailing list