[Tutor] Unknown encoded file types.

Wed Feb 10 22:26:38 EST 2021

On 2/10/21 5:05 PM, dn via Tutor wrote:
> On 10/02/2021 23.47, mhysnm1964 at gmail.com wrote:
>> All,
>>
>> Thank you for your assistance. After doing more investigation. There is some
>> unusual characters in the files which look like French or similar languages.
>> So I will play with your kind code samples and libraries to see what is
>> being used.
>
> Which leads this non-MS-Win user to ask:
>
> if UTF-8/standard Python has 'trouble' understanding files originally
> created in ISO 8859-n locale, eg MS-Win's Latin-1 etc; given MSFT's
> assumptions of eco-system/world-domination, does today's version of
> Win10 enjoy suitable backwards-compatibility and thus have no
> difficulties, eg a FileManager listing file-names or a NotePad
> displaying text-content?
>
> (is a/the solution to 'upgrade' to UTF-8 'there', and thereafter
> Python's I/O will perform without incident)

One note about the Windows file system names. File names come in two
'flavors', short file names, with a fixed 8.3 format, which are always
stored in the system default 8 bit code page. Then there are the 'Long
File Names' which can be basically arbitrarily long and are always
stored in UTF-16 (originally UCS-2). If a file has both a long name and
a short name the short name will be hidden from the user.

Note that because the short file names use the system encoding, moving a
removable media from one machine to another using a very different code
page, can give some strange short file names.

The place you have the issue is looking at the CONTENTS of the file,
where unless the file format somehow encodes the file encoding format,
you have the issue that you need to guess to figure out which 8 bit code
page a file was made with (and if it was).

This issue is more of a problem on Windows, because Windows roots go
back significantly farther (especially in international markets) and
thus has more legacy issues. The *nix world had the advantage of going
international later, and at that point UTF-8 was a good option and got
around the code page encoding issues.

-- 
Richard Damon