On Nov 17, 2016, at 6:43 AM, Mark Williams <markrwilliams@gmail.com> wrote:

On Wed, Nov 16, 2016 at 11:22:49PM -0800, Glyph Lefkowitz wrote:
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).

Falling back to latin-1 will address the most obvious issue exposed by
the client in the re-opened ticket. It will not fix the general issue.

This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.

The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.

More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.

Note that my sample was heavily biased towards European servers.
Other IRC servers in other regions might prefer a different 8-bit
encoding, like windows-1251 or Big5. And often a single server will
see a long tail (or at least a tail) of different 8-bit encodings.
Listing all channels on a server, as the example script does, cannot
be done with an implementation that decodes input as text prior to
parsing it. It's even possible to use chardet to detect encodings.

If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)

IRC's encoding situation mirrors file systems' one on POSIX. A given
path's components can be in multiple encodings. I believe at least
part of the reason FilePath's paths are bytes, even when
surrogateescape exists, is that Unicode paths on POSIX systems would
make FilePath unusable for perfectly valid use cases. We can pretend
that IRC has a defined encoding, but doing so will make unusable for
perfectly valid use cases.

POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.

First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such as windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.

This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.

There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>.

While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.

We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.

To bring all this back to IRC though:

Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.

Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.