On Nov 18, 2016, at 12:13 AM, Mark Williams <markrwilliams@gmail.com> wrote:

On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:

This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.

It's not a shipped feature so it can't be a regression. But if the
feature doesn't work it shouldn't be shipped.

"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed?

I should say up front here that I think I was being too emphatic in my support for UTF-8. We absolutely must support the ability to decode other encodings. I don't think that means we need support for access to raw bytes.

I did consult the policy manual before opening revert PR. Here's what
seemed most relevant:

https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange

This, and the other revert documents, focus on test regressions. But
I opened the PR because of the above link's mention of "undesirable."
Is there a better resource that explains when a revert is appropriate?

Test regressions are listed because they're unambiguously cause for a revert; "undesirable" is intentionally vague because we might decide to revert a thing for no reason. I guess opening a PR for a discussion like this is reasonable.

This could be considered an incompatible interface change; I'm honestly not sure about the exact type signatures of various methods to say whether it is or not.

The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.

IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain
protocol elements (usernames and metadata). But it needs to be
backwards compatible, so it can't mandate it for all messages. And it
is not IRC as specified by RFC1459. So no, no defined encoding.

Not only "no defined encoding" but also no mechanism like HTTP headers to say what the encoding is.

More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.

I can write code that uses the encoding that makes sense for my use
case. I can't if we mandate utf-8, even when I receive perfectly
valid IRC messages.

Sorry, I haven't been separating out my lines of reasoning clearly enough here.

My points are, separately:

IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".
UTF-8 is good. There should be gradual social pressure to use UTF-8 everywhere (I'm a fan of http://utf8everywhere.org). This is especially true in protocols like IRC and filenames where there's no mechanism to specify an encoding so that it can be correctly decoded. Therefore:

an initial release which features UTF-8 only is fine; therefore there's no need to do a revert.
defaulting to UTF-8 is reasonable for the forseeable future; users should only change this if they know that they want something unusual.

IRC is an incompatible and broken wasteland; thanks to your quantitative research we know exactly how broken. Therefore:

"support alternate encodings" is a valuable feature. Supporting point 2.1, this feature can be added on at any later point, making a revert of the present implementation unnecessary.
We can, and should, just go ahead and add support for alternate (per-server, per-channel, per-user) default and fallback encodings.
We should always have a fallback encoding, since blowing up on "invalid" data on a protocol where there's no standard to say what is or isn't valid doesn't seem very helpful.

If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)

It does not, but if that makes it more generally usable you've given a
great idea for my next PyPI package :)

Let me know :).

POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.

First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such pas windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.

This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.

There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.

When I received Arabic PDFs on a FAT16 USB drive with filenames in
CP1256, I had to switch mlterm to that particular code page to read
the directory listings so I could use convmv to convert them to UTF-8.

There is no question that your life has been hard, and that a wide array of people have made bad decisions that contribute to your difficulties. :-)

I'll note that this was impossible to do with a GTK-based tool.

In other words, the thing those pathnames are encoding is text; the way they're being encoded is codepage 1256 on the platter. However, the interface between the OS and the application can still be "text" (i.e. UTF-8) without breaking the on-disk "bytes" (cp1256).

Similarly, Twisted provides an IRC *library*. It's a Python API, not
irssi or Textual. The ultimate consumer of what passes through it may
be a human, but the next consumer might not be. What if I want to
write write a bot that bridges two IRC networks? What if I want to
dump the raw IRC data to a file so I can train a tensorflow version of
chardet? There's nothing in the IRC specification that prevents me
from doing this, but there will be something in Twisted's
implementation that does.

In the current release, yes. But in a future release: no, you can't just bridge arbitrary bytes between two networks and expect them to work. Those networks (or channels, or users) might have different implicit encoding rules; which, by default and only by default, should be utf-8. In a multi-encoding world, you may need to transcode between them to properly bridge; this is a consequence of the fact that eventually you're presenting this data as text to human eyeballs.

While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.

We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.

https://twistedmatrix.com/trac/ticket/8908

Thanks for filing that!

To bring all this back to IRC though:

Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.

It would also have to be per server, since any two channels might
disagree on the encoding of their topics. And the welcome message
might be in its own encoding. And, and, and...

Right. Per-server default, and then per-channel and per-(privmsg)-user is about as precise as you can get though. In principle, it's possible that different segments of the same topic could be in different encodings, different words in the same sentence! In practice though that just means somebody screwed up and the topic is now unreadable garbage in all clients.

But none of this is actually true. What seems to be true is that
non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes
to regularly seen on many other IRC servers. These encodings are
certainly used.

I can't really parse you here - are you saying that each network more or less sticks to one encoding?

Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.

Doing this ensures Twisted's IRC implementation will be unable to
communicate with a significant minority of users, and will be a less
useful programming tool.

Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.

It makes more sense to have an implementation that parses protocol
elements as bytes and provides a bytes API. It's fine to also provide
a decoded text API, but not to the exclusion of bytes.

This is the point where I think we diverge. I don't think adding a bytes API actually adds any value. Trying to process the contents of IRC as bytes as any way leads to inevitable failures ("line" truncation midway through a UTF-8 escape sequence for example).

So, the thing IRC is transmitting is text. The way it's transmitting it is poorly specified and will need manual configurability hooks to specify encoding information, probably forever, and perhaps even to guess it (although "encoding=chardet" would be nice). I agree that just saying "UTF-8 or GTFO" is not a sustainable approach at all. "UTF-8 or have a bad time with this fiddly customization API and config file" is fine, because anyone wanting something else is probably already having a bad time.

If you are engaging in a real abuse of the IRC protocol and you're treating it as an 8-bit clean stream to send some escaped binary data through (like a video stream, something like that), well, that's what the 'charmap' alias of 'latin-1' is for :-).

So... have I sold you?

-glyph