[Tutor] Unknown encoded file types.

Sun Feb 7 06:02:13 EST 2021

Cameron,

I know they are not all the same encoding. The files which  throw the
encoding errors open fine in a text editor as plain English. I wouldn't be
surprised some are in plain ASCII or using other type of UTF. Your comment
on using readline instead of read is interesting. As the first 2 or 4 bytes
of a file from my understanding is where the UTF (encoding) information is
stored. Is this correct?

My understanding of the difference between readline and read is how the
information is stored. Readline stores it in a list while read stores as a
string. Can you read a single line from each file? I haven't looked into
this. If I look at each first file and test it against plain ascii and a few
UTF common file formats. This might give me more info.

Note: for got to mention before. When I had about 10 files in a byte
variable. I could not convert into ASCII or UTF using the decode method. It
complained about a character not being able to be decoded. Gave some offset
in the error message. I will post the error tomorrow. To late here now.

Sean 

-----Original Message-----
From: Tutor <tutor-bounces+mhysnm1964=gmail.com at python.org> On Behalf Of
Cameron Simpson
Sent: Sunday, 7 February 2021 8:55 PM
To: tutor at python.org
Subject: Re: [Tutor] Unknown encoded file types.

On 07Feb2021 19:07, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>Windows 10, python 3.8 is what I am using.
>
>I have 100's of small plain text files that are under 5k each. I am 
>concatenating them into one big text file. The issue I am having is 
>getting encoding errors. I have tried to open them with the encode 
>parameter on the "with open" command. Some of the files are throwing
encoding UTF errors.
>Looking like they are not in that format. The only reliable way I have 
>managed to open the files  is in binary mode.
>
>With open (filename, 'rb') as fp:
>              Content = fp.read()

That's a fast way to read the whole file, provided that you know it is small
(you say above that you do, so good). But in other circumstances this is an
inviation to consume an arbitrary mount of memory.

>I don't need to process the content thus why I am not using
>fp.readline()

Ah, but you do if there's scope for multiple encodings. If you know they're
all the -same_ encoding you can just concatenate the bytes.  
Maybe forcing some newlines between them (but, hahaha, that requires knowing
the encoding).

But if they're mixed, putting them all in one file implies converting them
all the the _same_ encoding.

Do you know if they're all the same encoding? Or might they be a mix?

>Is there any way to identify the encoded format before opening to 
>change the encoded format? I have seen some info on the net and don't
understand it.

Reliably? No. There was recently quite a lot of discussion around this
problem on python-ideas.

There _are_ libraries which try to identify the encoding of some presumed to
be text file.

Regardless, the approach would be:
- open the file to learn its encoding (using some library)
- then open the file in the correct encoding

Cheers,
Cameron Simpson <cs at cskk.id.au>
_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor