[Chicago] Chardet help
Martin Maney
maney at two14.net
Sun Mar 10 16:51:39 CET 2013
On Sun, Mar 10, 2013 at 10:00:43AM -0500, Tathagata Dasgupta wrote:
> def getEncoding(infile):
> import chardet
> rawdata = open(infile, "r").read()
> result = chardet.detect(rawdata)
> charenc = result['encoding']
> print charenc
>
> That gives me ISO-8859-2.
That may be the problem. Why would Italian text be encoded in the
Central European character set? From a quick look at the raw data in
the browser, 8859-2 is obviously incorrect. 8859-1 looks better! In
fact, it looks better that 8859-3, the Southern European variant.
Guessing what encoding a text is in is always a pain. I don't know
what that chardet is, but from the results it appears to be less than
reliable.
Caveat: my guess is based on which encodings leave "unknown code point"
blobs and/or accent marks which I'm fairly sure Italian doesn't use.
But I have no Italian, myself.
--
The dualist evades the frame problem - but only because
dualism draws the veil of mystery and obfuscation
over all the tough how-questions -- Daniel C. Dennett
More information about the Chicago
mailing list