[Tutor] Unknown encoded file types.
Cameron Simpson
cs at cskk.id.au
Sun Feb 7 17:24:51 EST 2021
On 08Feb2021 07:54, DL Neil <PyTutor at DancesWithMice.info> wrote:
>On 08/02/2021 02.07, Alan Gauld via Tutor wrote:
>> The good news is that there is a finite number of such encodings and
>> its
>> not impossible, given you have a relatively small set of files(in
>> computer terms) you could just try each decoding in turn and display say
>> the first 10 lines for all that don't give an error. That would let
>> you select what looks like the best for each file.
>>
>> encodings = ['ascii','utf8','utf16', 'latin8539', 'latin..... etc]
>>
>> for file in files:
>> for encoding in encodings
>> try:
>> decode file
>> display filename + encoding + 10 lines
>> except: continue
>> else: print " no encoding worked for ' + file
>>
>> Just a thought.
>
>
>Whilst @Mark follows the above post with a relatively-automated
>suggestion (will still require manual inspection), I was going to
>suggest something similar to this idea of a test-bed routine ranging
>across a set of 'likely' encodings until you found happiness.
>
>This because you must surely have some hint of an idea of the source of
>the files, eg if they have come from a German, Danish, ... user.
>Accordingly, running through a range of the ISO 8859 variants which were
>employed by MS-Win OpSys may yield one choice which works without
>error.
Aye.
We've got an importer for some CSV data in a current project with
exactly that problem, and exactly that heuristic:
for encoding in 'utf-8', 'windows-1252', 'cp932':
try:
return line.decode(encoding)
except UnicodeDecodeError:
pass
warning(
'%r, line %d: cannot decode line %r, falling back to iso8859-1',
self.filename, self.lineno, line
)
return line.decode('iso8859-1')
The choice of encodings above is entirely parochial to our source data,
and still hits the fallback.
Sean, note that in the above code:
Successful decoding DOES NOT mean the correct encoding was used, as
various byte sequences can decode in multiple encodings, yeilding
different outcomes. See: https://en.wikipedia.org/wiki/Mojibake
The fallback above ('iso8859-1', ISO Latin 1) is an 8-bit encoding where
all bytes are 1-to-1 with the target ordinal. Like _any_ of the ISO8859
character sets, it will _always_ decode successfully because every byte
is accepted. That doesn't mean it is correct.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Tutor
mailing list