tail
Barry Scott
barry at barrys-emacs.org
Sun May 8 14:15:09 EDT 2022
> On 7 May 2022, at 22:31, Chris Angelico <rosuav at gmail.com> wrote:
>
> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>>
>> MRAB <python at mrabarnett.plus.com> writes:
>>> On 2022-05-07 19:47, Stefan Ram wrote:
>> ...
>>>> def encoding( name ):
>>>> path = pathlib.Path( name )
>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>> try:
>>>> with path.open( encoding=encoding, errors="strict" )as file:
>>>> text = file.read()
>>>> return encoding
>>>> except UnicodeDecodeError:
>>>> pass
>>>> return "ascii"
>>>> Yes, it's potentially slow and might be wrong.
>>>> The result "ascii" might mean it's a binary file.
>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>> by that encoding.
>>
>> Thank you! It's working for my specific application where
>> I'm reading from a collection of text files that should be
>> encoded in either utf_8, latin_1, or ascii.
>>
>
> In that case, I'd exclude ASCII from the check, and just check UTF-8,
> and if that fails, decode as Latin-1. Any ASCII files will decode
> correctly as UTF-8, and any file will decode as Latin-1.
>
> I've used this exact fallback system when decoding raw data from
> Unicode-naive servers - they accept and share bytes, so it's entirely
> possible to have a mix of encodings in a single stream. As long as you
> can define the span of a single "unit" (say, a line, or a chunk in
> some form), you can read as bytes and do the exact same "decode as
> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> perfectly ideal, but it's about as good as you'll get with a lot of
> US-based servers. (Depending on context, you might use CP-1252 instead
> of Latin-1, but you might need errors="replace" there, since
> Windows-1252 has some undefined byte values.)
There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.
There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.
Barry
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list