[Tutor] Unknown encoded file types.

Wed Feb 10 05:47:23 EST 2021

All,

Thank you for your assistance. After doing more investigation. There is some
unusual characters in the files which look like French or similar languages.
So I will play with your kind code samples and libraries to see what is
being used.

Sean 
-----Original Message-----
From: Tutor <tutor-bounces+mhysnm1964=gmail.com at python.org> On Behalf Of
Cameron Simpson
Sent: Monday, 8 February 2021 9:25 AM
To: tutor at python.org
Subject: Re: [Tutor] Unknown encoded file types.

On 08Feb2021 07:54, DL Neil <PyTutor at DancesWithMice.info> wrote:
>On 08/02/2021 02.07, Alan Gauld via Tutor wrote:
>> The good news is that there is a finite number of such encodings and 
>> its not impossible, given you have a relatively small set of files(in 
>> computer terms) you could just try each decoding in turn and display 
>> say the first 10 lines for all that don't give an error. That would 
>> let you select what looks like the best for each file.
>>
>> encodings = ['ascii','utf8','utf16', 'latin8539', 'latin..... etc]
>>
>> for file in files:
>>     for encoding in encodings
>>         try:
>>            decode file
>>            display filename + encoding + 10 lines
>>         except: continue
>>     else: print " no encoding worked for ' + file
>>
>> Just a thought.
>
>
>Whilst @Mark follows the above post with a relatively-automated 
>suggestion (will still require manual inspection), I was going to 
>suggest something similar to this idea of a test-bed routine ranging 
>across a set of 'likely' encodings until you found happiness.
>
>This because you must surely have some hint of an idea of the source of 
>the files, eg if they have come from a German, Danish, ... user.
>Accordingly, running through a range of the ISO 8859 variants which 
>were employed by MS-Win OpSys may yield one choice which works without 
>error.

Aye.

We've got an importer for some CSV data in a current project with exactly
that problem, and exactly that heuristic:

    for encoding in 'utf-8', 'windows-1252', 'cp932':
        try:
            return line.decode(encoding)
        except UnicodeDecodeError:
            pass
    warning(
        '%r, line %d: cannot decode line %r, falling back to iso8859-1',
        self.filename, self.lineno, line
    )
    return line.decode('iso8859-1')

The choice of encodings above is entirely parochial to our source data, and
still hits the fallback.

Sean, note that in the above code:

Successful decoding DOES NOT mean the correct encoding was used, as various
byte sequences can decode in multiple encodings, yeilding different
outcomes. See: https://en.wikipedia.org/wiki/Mojibake

The fallback above ('iso8859-1', ISO Latin 1) is an 8-bit encoding where all
bytes are 1-to-1 with the target ordinal. Like _any_ of the ISO8859
character sets, it will _always_ decode successfully because every byte is
accepted. That doesn't mean it is correct.

Cheers,
Cameron Simpson <cs at cskk.id.au>
_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor