[Tutor] Unknown encoded file types.
mhysnm1964 at gmail.com
mhysnm1964 at gmail.com
Wed Feb 10 05:47:23 EST 2021
All,
Thank you for your assistance. After doing more investigation. There is some
unusual characters in the files which look like French or similar languages.
So I will play with your kind code samples and libraries to see what is
being used.
Sean
-----Original Message-----
From: Tutor <tutor-bounces+mhysnm1964=gmail.com at python.org> On Behalf Of
Cameron Simpson
Sent: Monday, 8 February 2021 9:25 AM
To: tutor at python.org
Subject: Re: [Tutor] Unknown encoded file types.
On 08Feb2021 07:54, DL Neil <PyTutor at DancesWithMice.info> wrote:
>On 08/02/2021 02.07, Alan Gauld via Tutor wrote:
>> The good news is that there is a finite number of such encodings and
>> its not impossible, given you have a relatively small set of files(in
>> computer terms) you could just try each decoding in turn and display
>> say the first 10 lines for all that don't give an error. That would
>> let you select what looks like the best for each file.
>>
>> encodings = ['ascii','utf8','utf16', 'latin8539', 'latin..... etc]
>>
>> for file in files:
>> for encoding in encodings
>> try:
>> decode file
>> display filename + encoding + 10 lines
>> except: continue
>> else: print " no encoding worked for ' + file
>>
>> Just a thought.
>
>
>Whilst @Mark follows the above post with a relatively-automated
>suggestion (will still require manual inspection), I was going to
>suggest something similar to this idea of a test-bed routine ranging
>across a set of 'likely' encodings until you found happiness.
>
>This because you must surely have some hint of an idea of the source of
>the files, eg if they have come from a German, Danish, ... user.
>Accordingly, running through a range of the ISO 8859 variants which
>were employed by MS-Win OpSys may yield one choice which works without
>error.
Aye.
We've got an importer for some CSV data in a current project with exactly
that problem, and exactly that heuristic:
for encoding in 'utf-8', 'windows-1252', 'cp932':
try:
return line.decode(encoding)
except UnicodeDecodeError:
pass
warning(
'%r, line %d: cannot decode line %r, falling back to iso8859-1',
self.filename, self.lineno, line
)
return line.decode('iso8859-1')
The choice of encodings above is entirely parochial to our source data, and
still hits the fallback.
Sean, note that in the above code:
Successful decoding DOES NOT mean the correct encoding was used, as various
byte sequences can decode in multiple encodings, yeilding different
outcomes. See: https://en.wikipedia.org/wiki/Mojibake
The fallback above ('iso8859-1', ISO Latin 1) is an 8-bit encoding where all
bytes are 1-to-1 with the target ordinal. Like _any_ of the ISO8859
character sets, it will _always_ decode successfully because every byte is
accepted. That doesn't mean it is correct.
Cheers,
Cameron Simpson <cs at cskk.id.au>
_______________________________________________
Tutor maillist - Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
More information about the Tutor
mailing list