[Python-ideas] Python 3000 TIOBE -3%

Mon Feb 13 06:04:40 CET 2012

Masklinn writes:

 > Except it's not processed as text, it's processed as "stuff with ascii
 > characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS
 > (which is kinda-sorta-extended-ascii but not exactly), and while using
 > an ISO-8859 will yield unicode data that's about the only thing you can
 > say about it and the actual result will probably be mojibake either
 > way.

That's the coding pedant's way to look at it.  However, people who
speak only ASCII or Latin 1 are in general not going to see it that
way.

The ASCII speakers are a pretty clear-cut case.  Using 'latin-1' as
the codec, almost all things they can do with a 100% ASCII program and
a sanely-encoded text (which leaves out Shift JIS, Big 5, and maybe
some obsolete Vietnamese encodings, but not much else AFAIK) will pass
through the non-ASCII verbatim, or delete it.

Latin 1 speakers are harder, because they might do things like convert
accented characters to their base, which would break multibyte
characters in Asian languages.  Still, one suspects that they mostly
won't care terribly much about that (if they did, they'd be interested
in using Unicode properly, and it would be worth investing the small
amount of time required to learn a couple of recipes).

 > By processing it as bytes, it's made explicit that this is not
 > known and decoded text (which is what unicode strings imply) but
 > that it's some semi-arbitrary ascii-compatible encoding and that's
 > the extent of the developer's knowledge and interest in it.

No, decoding with 'latin-1' is a far better approach for this purpose.
If the name bothers you, give it an alias like
'semi-arbitrary-ascii-compatible'.

The problem is that for many operations, b'*' and 'trailing text' are
incompatible.  Try concatenating them, or testing one against the
other with .startswith(), or whatever.  Such literals are buried in
many modules, and you will lose if you're using bytes because those
modules generally assume you're working with str.