tail

Sun May 8 14:27:10 EDT 2022

On Mon, 9 May 2022 at 04:15, Barry Scott <barry at barrys-emacs.org> wrote:
>
>
>
> > On 7 May 2022, at 22:31, Chris Angelico <rosuav at gmail.com> wrote:
> >
> > On Sun, 8 May 2022 at 07:19, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> >>
> >> MRAB <python at mrabarnett.plus.com> writes:
> >>> On 2022-05-07 19:47, Stefan Ram wrote:
> >> ...
> >>>> def encoding( name ):
> >>>>   path = pathlib.Path( name )
> >>>>   for encoding in( "utf_8", "latin_1", "cp1252" ):
> >>>>       try:
> >>>>           with path.open( encoding=encoding, errors="strict" )as file:
> >>>>               text = file.read()
> >>>>           return encoding
> >>>>       except UnicodeDecodeError:
> >>>>           pass
> >>>>   return "ascii"
> >>>> Yes, it's potentially slow and might be wrong.
> >>>> The result "ascii" might mean it's a binary file.
> >>> "latin-1" will decode any sequence of bytes, so it'll never try
> >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >>> anyway because the file could contain 0x80..0xFF, which aren't supported
> >>> by that encoding.
> >>
> >>  Thank you! It's working for my specific application where
> >>  I'm reading from a collection of text files that should be
> >>  encoded in either utf_8, latin_1, or ascii.
> >>
> >
> > In that case, I'd exclude ASCII from the check, and just check UTF-8,
> > and if that fails, decode as Latin-1. Any ASCII files will decode
> > correctly as UTF-8, and any file will decode as Latin-1.
> >
> > I've used this exact fallback system when decoding raw data from
> > Unicode-naive servers - they accept and share bytes, so it's entirely
> > possible to have a mix of encodings in a single stream. As long as you
> > can define the span of a single "unit" (say, a line, or a chunk in
> > some form), you can read as bytes and do the exact same "decode as
> > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> > perfectly ideal, but it's about as good as you'll get with a lot of
> > US-based servers. (Depending on context, you might use CP-1252 instead
> > of Latin-1, but you might need errors="replace" there, since
> > Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>

Yeah, or sometimes, there isn't *anything* in UTF-8, and it has some
sort of straight-up lie in the form of a meta tag. It's annoying. But
the same logic still applies: attempt one decode (UTF-8) and if it
fails, there's one fallback. Fairly simple.

ChrisA