[Tutor] Unknown encoded file types.

dn PyTutor at DancesWithMice.info
Sun Feb 7 13:54:34 EST 2021


On 08/02/2021 02.07, Alan Gauld via Tutor wrote:
> On 07/02/2021 09:55, mhysnm1964 at gmail.com wrote:
> 
>> to work out how to clean the file to remove any text that don't fall within
>> the western language. Far as I am aware, only European / English should be
>> present. More English than anything else.
> 
> Most of the non standard characters in Latin encodings for European
> languages are to cater for all the ornamentations that Europeans seem to
> like. umaut, circumflex, cedilla, grave etc. Stripping out those
> characters will leave words that are hard to translate.
> 
> The good news is that there is a finite number of such encodings and its
> not impossible, given you have a relatively small set of files(in
> computer terms) you could just try each decoding in turn and display say
> the first 10 lines for all that don't give an error. That would let
> you select what looks like the best for each file.
> 
> encodings = ['ascii','utf8','utf16', 'latin8539', 'latin..... etc]
> 
> for file in files:
>     for encoding in encodings
>         try:
>            decode file
>            display filename + encoding + 10 lines
>         except: continue
>     else: print " no encoding worked for ' + file
> 
> Just a thought.


Whilst @Mark follows the above post with a relatively-automated
suggestion (will still require manual inspection), I was going to
suggest something similar to this idea of a test-bed routine ranging
across a set of 'likely' encodings until you found happiness.

This because you must surely have some hint of an idea of the source of
the files, eg if they have come from a German, Danish, ... user.
Accordingly, running through a range of the ISO 8859 variants which were
employed by MS-Win OpSys may yield one choice which works without error.

Failing that, start with the source (?Luke), and see if a visual
inspection of the file-contents, using NotePad/editor/word-processor
shows you any of those umlauts or other diacritical marks.

Did you (offer to) publish the (pertinent) contents of a sample file
together with the full error-listing generated by Python?
-- 
Regards,
=dn


More information about the Tutor mailing list