[Python-ideas] os.path.isbinary

Thu Aug 1 08:45:32 CEST 2013

Andrew Barnert writes:

 > Plenty of files in popular charsets are actually perfectly valid
 > UTF-8,

FVO "popular charset" in {ASCII} or "plenty of files" in "len(file) <
1KB", yes.  Otherwise, see below.

 > but garbage when read that way. This and its converse are

The converse *is* a problem, because the ISO 8859 family (and even
more so the Windows 125x family) basically use up all the bytes.

 > probably the most common cause of mojibake problems people have
 > today.

Actually the most common cause in my experience is Apache or MUA
configuration of a default charset and/or fallback to Latin-1 for
files actually written in UTF-8, combined with conformant browsers and
MUAs that respect transport-level defaults or protocol defaults rather
than try to detect the charset.  Viz:

 > (I don't know if you can search Stack Overflow for problems with
 > "Ã" in the description, but if you can, it'll be illuminating.)

But:

 > … you're probably better off following EAFP and just doing this:
 > 
 >     try:
 >         dotextstuff(b)
 >     except UnicodeDecodeError:
 >         dobinstuff(b)

Yes, indeedy!  Just because those algorithms exist doesn't mean it's a
good idea to use them (outside of some interactive applications like
text editors where the user can look at the mojibake and tell the
editor either the right encoding or to try another guess).