tail
MRAB
python at mrabarnett.plus.com
Sat May 7 15:26:01 EDT 2022
On 2022-05-07 19:47, Stefan Ram wrote:
> Marco Sulla <Marco.Sulla.Python at gmail.com> writes:
>>Well, ok, but I need a generic method to get LF and CR for any
>>encoding an user can input.
>
> "LF" and "CR" come from US-ASCII. It is theoretically
> possible that there might be some encodings out there
> (not for Unicode) that are not based on US-ASCII and
> have no LF or no CR.
>
>>is good for any encoding? Furthermore, is there a way to get the
>>encoding of an opened file object?
>
> I have written a function that might be able to detect one
> of few encodings based on a heuristic algorithm.
>
> def encoding( name ):
> path = pathlib.Path( name )
> for encoding in( "utf_8", "latin_1", "cp1252" ):
> try:
> with path.open( encoding=encoding, errors="strict" )as file:
> text = file.read()
> return encoding
> except UnicodeDecodeError:
> pass
> return "ascii"
>
> Yes, it's potentially slow and might be wrong.
> The result "ascii" might mean it's a binary file.
>
"latin-1" will decode any sequence of bytes, so it'll never try
"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
anyway because the file could contain 0x80..0xFF, which aren't supported
by that encoding.
More information about the Python-list
mailing list