Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

18 Nov 2016

      On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
...
This doesn't appear to be an answer to the "is it a regression" question though ;-).  I'm still curious what you think there.
It's not a shipped feature so it can't be a regression.  But if the
feature doesn't work it shouldn't be shipped.

I did consult the policy manual before opening revert PR.  Here's what
seemed most relevant:

https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange

This, and the other revert documents, focus on test regressions.  But
I opened the PR because of the above link's mention of "undesirable."
Is there a better resource that explains when a revert is appropriate?
...
The _general_ issue is unfixable, except to use chardet upon encoding errors.  As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.
IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain
protocol elements (usernames and metadata).  But it needs to be
backwards compatible, so it can't mandate it for all messages.  And it
is not IRC as specified by RFC1459.  So no, no defined encoding.
...
More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans.  If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility.  You can't write any code to process the data.
I can write code that uses the encoding that makes sense for my use
case.  I can't if we mandate utf-8, even when I receive perfectly
valid IRC messages.
...
If chardet is installed, can it be specified as an encoding itself?  Like, b"garbage garbage".decode("chardet")?  This would make it possible to use without binding to the library; you just specify an encoding.  (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)
It does not, but if that makes it more generally usable you've given a
great idea for my next PyPI package :)
...
POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.
First off, let me put to rest the lie that paths are "really" bytes.  Paths are text.  They must be text because they have to transit through text-processing systems, such pas windowing systems and and terminal programs.  Users must be able to visually identify and select them, as text.
This is significant because certain operations on paths-as-bytes will inevitably fail.  You can't type an invalidly-encoded pathname in your shell.  If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents.  This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.
There's POSIX metadata which allows you to select an encoding; locale.  But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path.  The standard metadata is insufficient.  This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶".  As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selec... https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selec...>.
When I received Arabic PDFs on a FAT16 USB drive with filenames in
CP1256, I had to switch mlterm to that particular code page to read
the directory listings so I could use convmv to convert them to UTF-8.
I'll note that this was impossible to do with a GTK-based tool.

Opinionated software is fine when it operates at the point of user
interpretation.

mlterm had to decode the stuff as unicode so X could display the
graphemes.  But if Linux's FAT16 implementation decided that we should
all quit whining and use UTF-8, even though no other FAT16
implementation requires this, it wouldn't have mattered what mlterm
could or couldn't do and I would have lost those files.  And it would
have been incredibly confounding to me, because everything would have
agreed that I had a FAT16 partition, but only Linux would have
mysteriously failed to read it.

Similarly, Twisted provides an IRC *library*.  It's a Python API, not
irssi or Textual.  The ultimate consumer of what passes through it may
be a human, but the next consumer might not be.  What if I want to
write write a bot that bridges two IRC networks?  What if I want to
dump the raw IRC data to a file so I can train a tensorflow version of
chardet?  There's nothing in the IRC specification that prevents me
from doing this, but there will be something in Twisted's
implementation that does.
...
While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure.  They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.
We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.
https://twistedmatrix.com/trac/ticket/8908
...
To bring all this back to IRC though:
Mis-encoded IRC messages are not data structures; they're just strings.  There's no opportunity for partial recovery beyond chardet and mojibake.  In most cases, partial recovery requires configuration.  Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.
It would also have to be per server, since any two channels might
disagree on the encoding of their topics.  And the welcome message
might be in its own encoding. And, and, and...

But none of this is actually true.  What seems to be true is that
non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes
to regularly seen on many other IRC servers.  These encodings are
certainly used.
...
Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.
Doing this ensures Twisted's IRC implementation will be unable to
communicate with a significant minority of users, and will be a less
useful programming tool.

It makes more sense to have an implementation that parses protocol
elements as bytes and provides a bytes API.  It's fine to also provide
a decoded text API, but not to the exclusion of bytes.

-Mark

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

Mark Williams