
On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.
It's not a shipped feature so it can't be a regression. But if the feature doesn't work it shouldn't be shipped. I did consult the policy manual before opening revert PR. Here's what seemed most relevant: https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange This, and the other revert documents, focus on test regressions. But I opened the PR because of the above link's mention of "undesirable." Is there a better resource that explains when a revert is appropriate?
The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.
IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain protocol elements (usernames and metadata). But it needs to be backwards compatible, so it can't mandate it for all messages. And it is not IRC as specified by RFC1459. So no, no defined encoding.
More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.
I can write code that uses the encoding that makes sense for my use case. I can't if we mandate utf-8, even when I receive perfectly valid IRC messages.
If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)
It does not, but if that makes it more generally usable you've given a great idea for my next PyPI package :)
POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.
First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such pas windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.
This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.
There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selec... <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.
When I received Arabic PDFs on a FAT16 USB drive with filenames in CP1256, I had to switch mlterm to that particular code page to read the directory listings so I could use convmv to convert them to UTF-8. I'll note that this was impossible to do with a GTK-based tool. Opinionated software is fine when it operates at the point of user interpretation. mlterm had to decode the stuff as unicode so X could display the graphemes. But if Linux's FAT16 implementation decided that we should all quit whining and use UTF-8, even though no other FAT16 implementation requires this, it wouldn't have mattered what mlterm could or couldn't do and I would have lost those files. And it would have been incredibly confounding to me, because everything would have agreed that I had a FAT16 partition, but only Linux would have mysteriously failed to read it. Similarly, Twisted provides an IRC *library*. It's a Python API, not irssi or Textual. The ultimate consumer of what passes through it may be a human, but the next consumer might not be. What if I want to write write a bot that bridges two IRC networks? What if I want to dump the raw IRC data to a file so I can train a tensorflow version of chardet? There's nothing in the IRC specification that prevents me from doing this, but there will be something in Twisted's implementation that does.
While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.
We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.
https://twistedmatrix.com/trac/ticket/8908
To bring all this back to IRC though:
Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.
It would also have to be per server, since any two channels might disagree on the encoding of their topics. And the welcome message might be in its own encoding. And, and, and... But none of this is actually true. What seems to be true is that non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes to regularly seen on many other IRC servers. These encodings are certainly used.
Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.
Doing this ensures Twisted's IRC implementation will be unable to communicate with a significant minority of users, and will be a less useful programming tool. It makes more sense to have an implementation that parses protocol elements as bytes and provides a bytes API. It's fine to also provide a decoded text API, but not to the exclusion of bytes. -Mark