[Tutor] Unknown encoded file types.

Sun Feb 7 08:07:23 EST 2021

On 07/02/2021 09:55, mhysnm1964 at gmail.com wrote:

> to work out how to clean the file to remove any text that don't fall within
> the western language. Far as I am aware, only European / English should be
> present. More English than anything else.

Most of the non standard characters in Latin encodings for European
languages are to cater for all the ornamentations that Europeans seem to
like. umaut, circumflex, cedilla, grave etc. Stripping out those
characters will leave words that are hard to translate.

The good news is that there is a finite number of such encodings and its
not impossible, given you have a relatively small set of files(in
computer terms) you could just try each decoding in turn and display say
the first 10 lines for all that don't give an error. That would let
you select what looks like the best for each file.

encodings = ['ascii','utf8','utf16', 'latin8539', 'latin..... etc]

for file in files:
    for encoding in encodings
        try:
           decode file
           display filename + encoding + 10 lines
        except: continue
    else: print " no encoding worked for ' + file

Just a thought.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos