[Tutor] Unknown encoded file types.
Alan Gauld
alan.gauld at yahoo.co.uk
Sun Feb 7 08:07:23 EST 2021
On 07/02/2021 09:55, mhysnm1964 at gmail.com wrote:
> to work out how to clean the file to remove any text that don't fall within
> the western language. Far as I am aware, only European / English should be
> present. More English than anything else.
Most of the non standard characters in Latin encodings for European
languages are to cater for all the ornamentations that Europeans seem to
like. umaut, circumflex, cedilla, grave etc. Stripping out those
characters will leave words that are hard to translate.
The good news is that there is a finite number of such encodings and its
not impossible, given you have a relatively small set of files(in
computer terms) you could just try each decoding in turn and display say
the first 10 lines for all that don't give an error. That would let
you select what looks like the best for each file.
encodings = ['ascii','utf8','utf16', 'latin8539', 'latin..... etc]
for file in files:
for encoding in encodings
try:
decode file
display filename + encoding + 10 lines
except: continue
else: print " no encoding worked for ' + file
Just a thought.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list