[Tutor] Unknown encoded file types.

Sun Feb 7 17:24:51 EST 2021

On 08Feb2021 07:54, DL Neil <PyTutor at DancesWithMice.info> wrote:
>On 08/02/2021 02.07, Alan Gauld via Tutor wrote:
>> The good news is that there is a finite number of such encodings and 
>> its
>> not impossible, given you have a relatively small set of files(in
>> computer terms) you could just try each decoding in turn and display say
>> the first 10 lines for all that don't give an error. That would let
>> you select what looks like the best for each file.
>>
>> encodings = ['ascii','utf8','utf16', 'latin8539', 'latin..... etc]
>>
>> for file in files:
>>     for encoding in encodings
>>         try:
>>            decode file
>>            display filename + encoding + 10 lines
>>         except: continue
>>     else: print " no encoding worked for ' + file
>>
>> Just a thought.
>
>
>Whilst @Mark follows the above post with a relatively-automated
>suggestion (will still require manual inspection), I was going to
>suggest something similar to this idea of a test-bed routine ranging
>across a set of 'likely' encodings until you found happiness.
>
>This because you must surely have some hint of an idea of the source of
>the files, eg if they have come from a German, Danish, ... user.
>Accordingly, running through a range of the ISO 8859 variants which were
>employed by MS-Win OpSys may yield one choice which works without 
>error.

Aye.

We've got an importer for some CSV data in a current project with 
exactly that problem, and exactly that heuristic:

    for encoding in 'utf-8', 'windows-1252', 'cp932':
        try:
            return line.decode(encoding)
        except UnicodeDecodeError:
            pass
    warning(
        '%r, line %d: cannot decode line %r, falling back to iso8859-1',
        self.filename, self.lineno, line
    )
    return line.decode('iso8859-1')

The choice of encodings above is entirely parochial to our source data, 
and still hits the fallback.

Sean, note that in the above code:

Successful decoding DOES NOT mean the correct encoding was used, as 
various byte sequences can decode in multiple encodings, yeilding 
different outcomes. See: https://en.wikipedia.org/wiki/Mojibake

The fallback above ('iso8859-1', ISO Latin 1) is an 8-bit encoding where 
all bytes are 1-to-1 with the target ordinal. Like _any_ of the ISO8859 
character sets, it will _always_ decode successfully because every byte 
is accepted. That doesn't mean it is correct.

Cheers,
Cameron Simpson <cs at cskk.id.au>