[Tutor] Unknown encoded file types.

Cameron Simpson cs at cskk.id.au
Sun Feb 7 07:04:56 EST 2021


On 07Feb2021 22:02, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>I know they are not all the same encoding. The files which  throw the
>encoding errors open fine in a text editor as plain English. I wouldn't be
>surprised some are in plain ASCII or using other type of UTF.

Plain English fits in ASCII. It will look exactly like ASCII in UTF8, 
and like 2-byte pairs of ASCII+NUL or NUL+ASCII in UTF16.

The summary of my verbaige below is that for your situation you may be 
able to infer (not deduce) a character set and encoding from sniffing 
the first 2 bytes.

>Your comment
>on using readline instead of read is interesting. As the first 2 or 4 bytes
>of a file from my understanding is where the UTF (encoding) information is
>stored. Is this correct?

_If_ it is UTF16. Possibly. Plain ASCII, for example, has nothing.

I understand that some common Windows programmes (Notepad?) write UTF16, 
likely UTF16LE (little endian). I also understand that often these text 
files start with a BOM (byte order marker), FFFE for UTF16BE and FEFF 
for UTF16LE.

If you scan the first 2 bytes and they are FFFE or FEFF and you're 
expecting text, then I think it very likely that they're utf16. And open 
accepts, for example, encoding='utf-6le' as an option.

See the codecs module in the standard library for information on 
available encodings. I noticed there that there are several byte order 
mark sequences. Be aware the ASCII, UTF-8 and the various ISO8859 8-bit 
charsets often don't have these markers at the start (the 8 bit ones 
certainly won't).

>My understanding of the difference between readline and read is how the
>information is stored. Readline stores it in a list while read stores as a
>string.

No, they both return a string in text mode. In binary mode read returns 
a bytes object, and readline returns a bytes object ending in code 10 
(newline, on the assumption that the bytes might be ASCII or ASCIIlike).  
If readline is even available in binary mode; I know bytes objects 
gained a few "str-like" methods.

>Can you read a single line from each file?

Reading a line requires recognising line endings, which depends on the 
encoding. But many character sets use ASCII as a base and newlines at 
the end of lines. (Or carriage returns, eg MacOS9 and earlier).

>I haven't looked into
>this. If I look at each first file and test it against plain ascii and a few
>UTF common file formats. This might give me more info.

That will be guesswork, but within a given domain (eg yours) it may be 
reliable.

>Note: for got to mention before. When I had about 10 files in a byte
>variable. I could not convert into ASCII or UTF using the decode method.

There are various UTFs. Unicode is a mapping of "code points" (ordinals) 
to characters (and some other things). UTF8, UTF16 et al are different 
_encodings_ of those ordinals for storage in bytes in a file.

So UTF8 has a variable number of bytes per ordinal which among its 
features are (a) it is compact for Western alphabets and (b) identical 
to ASCII For the the characters which are n the ASCII range. UTF16 uses 
2 bytes per ordinal, less compact but fixed width.

There are ordinals in Unicode beyond the 16 bit range, BTW.

>It
>complained about a character not being able to be decoded. Gave some offset
>in the error message. I will post the error tomorrow. To late here now.

A successful decode requires decoding all the bytes, so if your bytes 
end part way through a character the decode will fail even if you're 
using the right encoding.

But any UTF16 encoding will be an even number of bytes.

If you're dealing with "common" Windows files I'd expect to be able to 
sniff the first 2 bytes for a 16 bit BOM - if present, presume UTF16 in 
whichever flavour. Otherwise try UTF8 and see how it goes.

The various BOMs in the codecs modules suggest there might be other BOM 
sequences worth trying (not all 2 bytes long). And let is not get into 
other character sets and encodings (not ASCII, not UTF, eg EBCEDIC or 
Shift JIS).

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Tutor mailing list