[Tutor] Unknown encoded file types.

Sun Feb 7 04:54:48 EST 2021

On 07Feb2021 19:07, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>Windows 10, python 3.8 is what I am using.
>
>I have 100's of small plain text files that are under 5k each. I am
>concatenating them into one big text file. The issue I am having is getting
>encoding errors. I have tried to open them with the encode parameter on the
>"with open" command. Some of the files are throwing encoding UTF errors.
>Looking like they are not in that format. The only reliable way I have
>managed to open the files  is in binary mode.
>
>With open (filename, 'rb') as fp:
>              Content = fp.read()

That's a fast way to read the whole file, provided that you know it is 
small (you say above that you do, so good). But in other circumstances 
this is an inviation to consume an arbitrary mount of memory.

>I don't need to process the content thus why I am not using 
>fp.readline()

Ah, but you do if there's scope for multiple encodings. If you know 
they're all the -same_ encoding you can just concatenate the bytes.  
Maybe forcing some newlines between them (but, hahaha, that requires 
knowing the encoding).

But if they're mixed, putting them all in one file implies converting 
them all the the _same_ encoding.

Do you know if they're all the same encoding? Or might they be a mix?

>Is there any way to identify the encoded format before opening to 
>change the encoded format? I have seen some info on the net and don't understand it.

Reliably? No. There was recently quite a lot of discussion around this 
problem on python-ideas.

There _are_ libraries which try to identify the encoding of some 
presumed to be text file.

Regardless, the approach would be:
- open the file to learn its encoding (using some library)
- then open the file in the correct encoding

Cheers,
Cameron Simpson <cs at cskk.id.au>