[Tutor] Unknown encoded file types.
Alan Gauld
alan.gauld at yahoo.co.uk
Sun Feb 7 04:19:06 EST 2021
On 07/02/2021 08:07, mhysnm1964 at gmail.com wrote:
> I have 100's of small plain text files that are under 5k each. I am
> concatenating them into one big text file. The issue I am having is getting
> encoding errors. I have tried to open them with the encode parameter on the
> "with open" command. Some of the files are throwing encoding UTF errors.
> Looking like they are not in that format. The only reliable way I have
> managed to open the files is in binary mode.
Yes, that's right, the only reliable way of opening a file,
if you don't know what is in it, is using binary mode and
treating it as a stream of bytes.
You can interrogate the bytes and see if you recognise any
of them, or a sequence and from that infer an encoding.
You say they are text files, but how do you know? Even if
they have a .txt extension that's no guarantee that they
are really text. And if they are how old are they? If more
than 20 years you are likely to be facing all manner
of weird encodings.
> Is there any way to identify the encoded format before opening to change the
> encoded format? I have seen some info on the net and don't understand it.
Not with certainty. There are tools that can look at the first
few bytes and make an intelligent guess but none are reliable.
Opening a file without knowing what is in it is always fraught
with issues.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list