[Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement
Hi everyone, here's a Twisted release to hopefully lift your spirits a little. It's not a big one, but it's got some goodies regardless. It features: - The ability to use "python -m twisted" to call the new `twist` runner, - More reliable tests from a more reliable implementation of some things, like IOCP, - Fixes for async/await & twisted.internet.defer.ensureDeferred, meaning it's getting closer to prime time! - ECDSA support in Conch & ckeygen (which has also been ported to Python 3), - Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules, - Some HTTP/2 server optimisations, - and a few bugfixes to boot! You can get the tarball and the NEWS file at https://twistedmatrix.com/Releases/rc/16.6.0rc1/ <https://twistedmatrix.com/Releases/rc/16.6.0rc1/> , or you can try it out from PyPI: python -m pip install Twisted==16.6.0rc1 Please test it, and let me know how your applications fare, good or bad! If nothing comes up, I will release 16.6.0 next week. - Amber
On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,
I have opened a PR to revert this: https://github.com/twisted/twisted/pull/593 A full explanation is here: https://twistedmatrix.com/trac/ticket/6320#comment:16 In summary: a valid IRC message will cause a UnicodeDecodeError within the event loop that a user cannot handle or avoid, and all length checks on line sizes are wrong because they occur prior to encoding to utf-8.
On Nov 16, 2016, at 11:15 PM, Mark Williams <markrwilliams@gmail.com> wrote:
On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,
I have opened a PR to revert this:
https://github.com/twisted/twisted/pull/593
A full explanation is here:
https://twistedmatrix.com/trac/ticket/6320#comment:16
In summary: a valid IRC message will cause a UnicodeDecodeError within the event loop that a user cannot handle or avoid, and all length checks on line sizes are wrong because they occur prior to encoding to utf-8.
Reverts should be commits that go straight to trunk and reopen tickets, per the current process. However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it). -glyph
On 17 Nov. 2016, at 18:22, Glyph Lefkowitz <glyph@twistedmatrix.com> wrote:
On Nov 16, 2016, at 11:15 PM, Mark Williams <markrwilliams@gmail.com> wrote:
On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,
I have opened a PR to revert this:
https://github.com/twisted/twisted/pull/593
A full explanation is here:
https://twistedmatrix.com/trac/ticket/6320#comment:16
In summary: a valid IRC message will cause a UnicodeDecodeError within the event loop that a user cannot handle or avoid, and all length checks on line sizes are wrong because they occur prior to encoding to utf-8.
Reverts should be commits that go straight to trunk and reopen tickets, per the current process.
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).
-glyph
Yeah, this is just a plain old bug. Bugs in new features (where a module being on Python 3 counts as one to me) aren't regressions; we sometimes fix them in pre if there's time/other stuff is getting fixed, but this one will just be a known bug until 16.7 in December. - Amber
On 17 Nov. 2016, at 18:50, Amber Hawkie Brown <hawkowl@atleastfornow.net> wrote:
On 17 Nov. 2016, at 18:22, Glyph Lefkowitz <glyph@twistedmatrix.com <mailto:glyph@twistedmatrix.com>> wrote:
On Nov 16, 2016, at 11:15 PM, Mark Williams <markrwilliams@gmail.com <mailto:markrwilliams@gmail.com>> wrote:
On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
- Python 3 support for Words' IRC support and twisted.protocols.sip among some smaller modules,
I have opened a PR to revert this:
https://github.com/twisted/twisted/pull/593 <https://github.com/twisted/twisted/pull/593>
A full explanation is here:
https://twistedmatrix.com/trac/ticket/6320#comment:16
In summary: a valid IRC message will cause a UnicodeDecodeError within the event loop that a user cannot handle or avoid, and all length checks on line sizes are wrong because they occur prior to encoding to utf-8.
Reverts should be commits that go straight to trunk and reopen tickets, per the current process.
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).
-glyph
Yeah, this is just a plain old bug. Bugs in new features (where a module being on Python 3 counts as one to me) aren't regressions; we sometimes fix them in pre if there's time/other stuff is getting fixed, but this one will just be a known bug until 16.7 in December.
- Amber
(or a 16.6.1)
On Wed, Nov 16, 2016 at 11:22:49PM -0800, Glyph Lefkowitz wrote:
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).
Falling back to latin-1 will address the most obvious issue exposed by the client in the re-opened ticket. It will not fix the general issue. Note that my sample was heavily biased towards European servers. Other IRC servers in other regions might prefer a different 8-bit encoding, like windows-1251 or Big5. And often a single server will see a long tail (or at least a tail) of different 8-bit encodings. Listing all channels on a server, as the example script does, cannot be done with an implementation that decodes input as text prior to parsing it. It's even possible to use chardet to detect encodings. IRC's encoding situation mirrors file systems' one on POSIX. A given path's components can be in multiple encodings. I believe at least part of the reason FilePath's paths are bytes, even when surrogateescape exists, is that Unicode paths on POSIX systems would make FilePath unusable for perfectly valid use cases. We can pretend that IRC has a defined encoding, but doing so will make unusable for perfectly valid use cases.
-glyph
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
On Nov 17, 2016, at 6:43 AM, Mark Williams <markrwilliams@gmail.com> wrote:
On Wed, Nov 16, 2016 at 11:22:49PM -0800, Glyph Lefkowitz wrote:
However; is it really a regression to have py3 support for Words that just doesn't support other encodings yet? It strikes me that this is just a bug, and that we should just fall back from UTF-8 to latin-1 in this scenario. But adding that fallback is a small additional fix (perhaps one that should be slated for 16.6.0 if you want to make it).
Falling back to latin-1 will address the most obvious issue exposed by the client in the re-opened ticket. It will not fix the general issue.
This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there. The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding. More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.
Note that my sample was heavily biased towards European servers. Other IRC servers in other regions might prefer a different 8-bit encoding, like windows-1251 or Big5. And often a single server will see a long tail (or at least a tail) of different 8-bit encodings. Listing all channels on a server, as the example script does, cannot be done with an implementation that decodes input as text prior to parsing it. It's even possible to use chardet to detect encodings.
If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)
IRC's encoding situation mirrors file systems' one on POSIX. A given path's components can be in multiple encodings. I believe at least part of the reason FilePath's paths are bytes, even when surrogateescape exists, is that Unicode paths on POSIX systems would make FilePath unusable for perfectly valid use cases. We can pretend that IRC has a defined encoding, but doing so will make unusable for perfectly valid use cases.
Here we go :-). POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly. First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such as windowing systems and and terminal programs. Users must be able to visually identify and select them, as text. This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows. There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selec... <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>. While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it. We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error. To bring all this back to IRC though: Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata. Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it. -glyph
On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.
It's not a shipped feature so it can't be a regression. But if the feature doesn't work it shouldn't be shipped. I did consult the policy manual before opening revert PR. Here's what seemed most relevant: https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange This, and the other revert documents, focus on test regressions. But I opened the PR because of the above link's mention of "undesirable." Is there a better resource that explains when a revert is appropriate?
The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.
IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain protocol elements (usernames and metadata). But it needs to be backwards compatible, so it can't mandate it for all messages. And it is not IRC as specified by RFC1459. So no, no defined encoding.
More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.
I can write code that uses the encoding that makes sense for my use case. I can't if we mandate utf-8, even when I receive perfectly valid IRC messages.
If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)
It does not, but if that makes it more generally usable you've given a great idea for my next PyPI package :)
POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.
First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such pas windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.
This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.
There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selec... <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.
When I received Arabic PDFs on a FAT16 USB drive with filenames in CP1256, I had to switch mlterm to that particular code page to read the directory listings so I could use convmv to convert them to UTF-8. I'll note that this was impossible to do with a GTK-based tool. Opinionated software is fine when it operates at the point of user interpretation. mlterm had to decode the stuff as unicode so X could display the graphemes. But if Linux's FAT16 implementation decided that we should all quit whining and use UTF-8, even though no other FAT16 implementation requires this, it wouldn't have mattered what mlterm could or couldn't do and I would have lost those files. And it would have been incredibly confounding to me, because everything would have agreed that I had a FAT16 partition, but only Linux would have mysteriously failed to read it. Similarly, Twisted provides an IRC *library*. It's a Python API, not irssi or Textual. The ultimate consumer of what passes through it may be a human, but the next consumer might not be. What if I want to write write a bot that bridges two IRC networks? What if I want to dump the raw IRC data to a file so I can train a tensorflow version of chardet? There's nothing in the IRC specification that prevents me from doing this, but there will be something in Twisted's implementation that does.
While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.
We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.
https://twistedmatrix.com/trac/ticket/8908
To bring all this back to IRC though:
Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.
It would also have to be per server, since any two channels might disagree on the encoding of their topics. And the welcome message might be in its own encoding. And, and, and... But none of this is actually true. What seems to be true is that non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes to regularly seen on many other IRC servers. These encodings are certainly used.
Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.
Doing this ensures Twisted's IRC implementation will be unable to communicate with a significant minority of users, and will be a less useful programming tool. It makes more sense to have an implementation that parses protocol elements as bytes and provides a bytes API. It's fine to also provide a decoded text API, but not to the exclusion of bytes. -Mark
On Nov 18, 2016, at 12:13 AM, Mark Williams <markrwilliams@gmail.com> wrote:
On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
This doesn't appear to be an answer to the "is it a regression" question though ;-). I'm still curious what you think there.
It's not a shipped feature so it can't be a regression. But if the feature doesn't work it shouldn't be shipped.
"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed? I should say up front here that I think I was being too emphatic in my support for UTF-8. We absolutely must support the ability to decode other encodings. I don't think that means we need support for access to raw bytes.
I did consult the policy manual before opening revert PR. Here's what seemed most relevant:
https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange
This, and the other revert documents, focus on test regressions. But I opened the PR because of the above link's mention of "undesirable." Is there a better resource that explains when a revert is appropriate?
Test regressions are listed because they're unambiguously cause for a revert; "undesirable" is intentionally vague because we might decide to revert a thing for no reason. I guess opening a PR for a discussion like this is reasonable. This could be considered an incompatible interface change; I'm honestly not sure about the exact type signatures of various methods to say whether it is or not.
The _general_ issue is unfixable, except to use chardet upon encoding errors. As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.
IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain protocol elements (usernames and metadata). But it needs to be backwards compatible, so it can't mandate it for all messages. And it is not IRC as specified by RFC1459. So no, no defined encoding.
Not only "no defined encoding" but also no mechanism like HTTP headers to say what the encoding is.
More importantly, IRC doesn't specify an encoding and it is also responsible for transmitting textual data intended to be input and consumed by humans. If you can't decode it, faithfully replicating the on-the-wire encoding is of limited utility. You can't write any code to process the data.
I can write code that uses the encoding that makes sense for my use case. I can't if we mandate utf-8, even when I receive perfectly valid IRC messages.
Sorry, I haven't been separating out my lines of reasoning clearly enough here. My points are, separately: IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC". UTF-8 is good. There should be gradual social pressure to use UTF-8 everywhere (I'm a fan of http://utf8everywhere.org <http://utf8everywhere.org/>). This is especially true in protocols like IRC and filenames where there's no mechanism to specify an encoding so that it can be correctly decoded. Therefore: an initial release which features UTF-8 only is fine; therefore there's no need to do a revert. defaulting to UTF-8 is reasonable for the forseeable future; users should only change this if they know that they want something unusual. IRC is an incompatible and broken wasteland; thanks to your quantitative research we know exactly how broken. Therefore: "support alternate encodings" is a valuable feature. Supporting point 2.1, this feature can be added on at any later point, making a revert of the present implementation unnecessary. We can, and should, just go ahead and add support for alternate (per-server, per-channel, per-user) default and fallback encodings. We should always have a fallback encoding, since blowing up on "invalid" data on a protocol where there's no standard to say what is or isn't valid doesn't seem very helpful.
If chardet is installed, can it be specified as an encoding itself? Like, b"garbage garbage".decode("chardet")? This would make it possible to use without binding to the library; you just specify an encoding. (The library is LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)
It does not, but if that makes it more generally usable you've given a great idea for my next PyPI package :)
Let me know :).
POSIX has an internally inconsistent model of how encodings work; they cannot possibly function correctly.
First off, let me put to rest the lie that paths are "really" bytes. Paths are text. They must be text because they have to transit through text-processing systems, such pas windowing systems and and terminal programs. Users must be able to visually identify and select them, as text.
This is significant because certain operations on paths-as-bytes will inevitably fail. You can't type an invalidly-encoded pathname in your shell. If two paths differ by an incorrectly-encoded character you won't be able to visually distinguish between them without inspecting their contents. This is why OS X forces all paths to be UTF-8, and why paths are "really" unicode (UCS-2, precisely) on Windows.
There's POSIX metadata which allows you to select an encoding; locale. But, locale is per-process state, and, due to the fact that you can have multiple filesystems mounted simultaneously, it's impossible for this metadata to fully describe the state of any arbitrary path. The standard metadata is insufficient. This is why UI toolkits like GTK+ have adopted the policy of "ignore the locale, paths are UTF-8, deal with it 🕶". As far back as GTK2, non-utf-8 path selection has been deprecated: <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selec... <https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.
When I received Arabic PDFs on a FAT16 USB drive with filenames in CP1256, I had to switch mlterm to that particular code page to read the directory listings so I could use convmv to convert them to UTF-8.
There is no question that your life has been hard, and that a wide array of people have made bad decisions that contribute to your difficulties. :-)
I'll note that this was impossible to do with a GTK-based tool.
Opinionated software is fine when it operates at the point of user interpretation.
mlterm had to decode the stuff as unicode so X could display the graphemes. But if Linux's FAT16 implementation decided that we should all quit whining and use UTF-8, even though no other FAT16 implementation requires this, it wouldn't have mattered what mlterm could or couldn't do and I would have lost those files. And it would have been incredibly confounding to me, because everything would have agreed that I had a FAT16 partition, but only Linux would have mysteriously failed to read it.
But, Linux's FAT16 driver has decided that. The correct way to solve your problem with current Linux (I don't know if this was possible at the time) would be to address it with mount, not special user-space software. Specifically, I think it would be something like: mount -t fat -o fat=16,iocharset=utf-8,codepage=1256 /dev/disk/by-label/arabic.msdos /media/arabic.msdos Now all your GTK+ software works, too, because you're not trying to reconcile your legacy format support at the application level. In other words, the thing those pathnames are encoding is text; the way they're being encoded is codepage 1256 on the platter. However, the interface between the OS and the application can still be "text" (i.e. UTF-8) without breaking the on-disk "bytes" (cp1256).
Similarly, Twisted provides an IRC *library*. It's a Python API, not irssi or Textual. The ultimate consumer of what passes through it may be a human, but the next consumer might not be. What if I want to write write a bot that bridges two IRC networks? What if I want to dump the raw IRC data to a file so I can train a tensorflow version of chardet? There's nothing in the IRC specification that prevents me from doing this, but there will be something in Twisted's implementation that does.
In the current release, yes. But in a future release: no, you can't just bridge arbitrary bytes between two networks and expect them to work. Those networks (or channels, or users) might have different implicit encoding rules; which, by default and only by default, should be utf-8. In a multi-encoding world, you may need to transcode between them to properly bridge; this is a consequence of the fact that eventually you're presenting this data as text to human eyeballs.
While a mis-encoded path is a failure, there are ways to treat paths as a data structure to allow for only partial failure. They're a data structure because they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 0x2F, and so you can fail on each individual path component; if you're lucky you don't need to present all the components in the path to manipulate it.
We don't do this in Twisted right now (as I was somewhat disappointed to discover while writing this), but we should, and more importantly we could; FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" rather than returning an encoding error.
https://twistedmatrix.com/trac/ticket/8908 <https://twistedmatrix.com/trac/ticket/8908>
Thanks for filing that!
To bring all this back to IRC though:
Mis-encoded IRC messages are not data structures; they're just strings. There's no opportunity for partial recovery beyond chardet and mojibake. In most cases, partial recovery requires configuration. Per-channel encodings, for example, or per-user, which have to be agreed upon out of band, in ways that IRC does not expose as metadata.
It would also have to be per server, since any two channels might disagree on the encoding of their topics. And the welcome message might be in its own encoding. And, and, and...
Right. Per-server default, and then per-channel and per-(privmsg)-user is about as precise as you can get though. In principle, it's possible that different segments of the same topic could be in different encodings, different words in the same sentence! In practice though that just means somebody screwed up and the topic is now unreadable garbage in all clients.
But none of this is actually true. What seems to be true is that non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes to regularly seen on many other IRC servers. These encodings are certainly used.
I can't really parse you here - are you saying that each network more or less sticks to one encoding?
Given this situation, the only reasonable way forward as a community is to tell users that using anything other than UTF-8 is a misconfiguration and we need to be getting all those out-of-band agreements to switch to it.
Doing this ensures Twisted's IRC implementation will be unable to communicate with a significant minority of users, and will be a less useful programming tool.
Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.
It makes more sense to have an implementation that parses protocol elements as bytes and provides a bytes API. It's fine to also provide a decoded text API, but not to the exclusion of bytes.
This is the point where I think we diverge. I don't think adding a bytes API actually adds any value. Trying to process the contents of IRC as bytes as any way leads to inevitable failures ("line" truncation midway through a UTF-8 escape sequence for example). So, the thing IRC is transmitting is text. The way it's transmitting it is poorly specified and will need manual configurability hooks to specify encoding information, probably forever, and perhaps even to guess it (although "encoding=chardet" would be nice). I agree that just saying "UTF-8 or GTFO" is not a sustainable approach at all. "UTF-8 or have a bad time with this fiddly customization API and config file" is fine, because anyone wanting something else is probably already having a bad time. If you are engaging in a real abuse of the IRC protocol and you're treating it as an 8-bit clean stream to send some escaped binary data through (like a video stream, something like that), well, that's what the 'charmap' alias of 'latin-1' is for :-). So... have I sold you? -glyph
On Fri, Nov 18, 2016 at 05:36:16PM -0800, Glyph Lefkowitz wrote:
"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed?
Yes. Here's the lede: IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like thing that encodes and decodes command prefixes and parameters. It should implement an interface, and we can start with an implementation that only knows about UTF-8. The obvious advantage of this is that you can more easily write IRCClients that work on both Python 2 and 3. I'll attempt to explain others in the rest of this email.
I should say up front here that I think I was being too emphatic in my support for UTF-8.
Phew!
Test regressions are listed because they're unambiguously cause for a revert; "undesirable" is intentionally vague because we might decide to revert a thing for no reason. I guess opening a PR for a discussion like this is reasonable.
Good to know!
This could be considered an incompatible interface change; I'm honestly not sure about the exact type signatures of various methods to say whether it is or not.
I'm also not entirely sure of the consequences of this interface change. I think it deserves more thought before it becomes an API that we have to support. This is the primary reason I opened the revert PR. I'm more precisely worried about the fact that the implementation raises a decoding exception that cannot be handled in user code when it receives non-UTF-8 messages, and the fact that the line length checks occur prior to encoding, ensuring mid-codepoint truncation. These issues also contributed to my revert.
My points are, separately:
IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".
It's nonsensical that it be finally presented to a human as raw bytes. I'm advocating for the decision to be made as late as possible. That doesn't mean we can't provide an easy-to-use recoding client that we encourage people to turn to first.
UTF-8 is good. There should be gradual social pressure to use UTF-8 everywhere (I'm a fan of http://utf8everywhere.org <http://utf8everywhere.org/>). This is especially true in protocols like IRC and filenames where there's no mechanism to specify an encoding so that it can be correctly decoded. Therefore: an initial release which features UTF-8 only is fine; therefore there's no need to do a revert. defaulting to UTF-8 is reasonable for the forseeable future; users should only change this if they know that they want something unusual. IRC is an incompatible and broken wasteland; thanks to your quantitative research we know exactly how broken. Therefore: "support alternate encodings" is a valuable feature. Supporting point 2.1, this feature can be added on at any later point, making a revert of the present implementation unnecessary. We can, and should, just go ahead and add support for alternate (per-server, per-channel, per-user) default and fallback encodings. We should always have a fallback encoding, since blowing up on "invalid" data on a protocol where there's no standard to say what is or isn't valid doesn't seem very helpful.
I appreciate the consistency of this, and agree the documented preference should be a client implementation that assumes UTF-8. But we can't have *a* fallback encoding. My encoding detector program indicates that latin-1 is the second most popular encoding for European IRC servers, but Russian servers I sampled (not in netsplit.de's top 10) used a variety of Cyrillic encodings. I also want to enable arbitrary recovery strategies for bad encodings. For instance, in the case that an IRC client or server truncates a code point at a line boundary, it might be the right idea to binary search until the invalid byte sequence is found, and then exclude it. It might be the right idea to buffer the message for a time in the hopes that the codepoint got split over two lines. And what if somebody wants to run another encoding survey? I don't expect most users to do any of that, but *I* certainly want to without having to copy and paste a bunch of code.
When I received Arabic PDFs on a FAT16 USB drive with filenames in CP1256, I had to switch mlterm to that particular code page to read the directory listings so I could use convmv to convert them to UTF-8.
There is no question that your life has been hard, and that a wide array of people have made bad decisions that contribute to your difficulties. :-)
My real point was that dealing with bad encodings is not theoretical. Nobody knew the encoding, by the way; they just knew the USB drive worked for some of them and not others, and were resulting to printing things out or taking screen shots. That's the situation opinionated software with monolithic abstractions creates. People *will* find workarounds that are terrible for a bunch of reasons. I can vouch for the utility of tools that decide on encodings as late as possible. Note that I'm not asking that we be everything to all people, but rather that we allow people the option of dealing with the IRC encoding disaster the way they see fit.
But, Linux's FAT16 driver has decided that.
The correct way to solve your problem with current Linux (I don't know if this was possible at the time) would be to address it with mount, not special user-space software. Specifically, I think it would be something like:
mount -t fat -o fat=16,iocharset=utf-8,codepage=1256 /dev/disk/by-label/arabic.msdos /media/arabic.msdos
Now all your GTK+ software works, too, because you're not trying to reconcile your legacy format support at the application level.
I don't remember either. But, now the driver *allows* me to do that without requiring it, and also allows me to mount the file system so that the paths are exposed as bytes. Since nobody knew the encoding, that was essential to letting me use mlterm to determine it. Nowadays I'd probably use chardet but would still need the raw bytes. And as far as I know code point sequence truncation can also occur on FAT16/32 partitions. In the event of such truncation the automatic decoding would only prevent me from mounting the partition. I'm thankful that the implementation allows me to choose a recovery strategy in a very real edge case. If it didn't, I'd have to look up the file system's on disk format and reimplement 99% of a FAT16 driver to get at the data. So it's the case that raw bytes weren't useful to me when I tried to actually read the paths, but they were super useful to me when a perfectly reasonable assumption was wrong. And when no encoding is mandated, perfectly reasonable assumptions do fail and fail often.
What if I want to write write a bot that bridges two IRC networks?
In the current release, yes. But in a future release: no, you can't just bridge arbitrary bytes between two networks and expect them to work. Those networks (or channels, or users) might have different implicit encoding rules; which, by default and only by default, should be utf-8. In a multi-encoding world, you may need to transcode between them to properly bridge; this is a consequence of the fact that eventually you're presenting this data as text to human eyeballs.
It's true that if one channel is latin-1 and the other is MacCyrillic that a text-only IRCClient implementation could handle this just by allowing the user to choose an encoding. The recoding API I'm talking about wouldn't give you anything. But it would help with truncation issues and channels' topics using different encodings.
But none of this is actually true. What seems to be true is that non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes to regularly seen on many other IRC servers. These encodings are certainly used.
I can't really parse you here - are you saying that each network more or less sticks to one encoding?
Not quite - I meant that in my survey, I saw no latin-1 on Freenode, but that may be because they decided I was abusing the network early on in my attempt to list and join channel. But on other networks I saw a lot of different encodings, used across different channels, so that the channel list contained topics encoded in many different 8-bit encodings.
Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.
For what it's worth, I want to make it easy to use UTF-8. I just don't want to make it hard to use an encoding that's *not* UTF-8.
It makes more sense to have an implementation that parses protocol elements as bytes and provides a bytes API. It's fine to also provide a decoded text API, but not to the exclusion of bytes.
This is the point where I think we diverge. I don't think adding a bytes API actually adds any value. Trying to process the contents of IRC as bytes as any way leads to inevitable failures ("line" truncation midway through a UTF-8 escape sequence for example).
This is precisely where we disagree. As I described above, I can think of a couple ways to handle mid-codepoint truncation. A Twisted-based IRC client should have the option to implement its own. The end result would still be text (or at least an informative log message.) I think the best way to handle this is to have a bytes-only IRC client that can then be wrapped with something that decodes prefixes and parameters. We can provide a UTF-8 recoder that people are encouraged to use, and an interface that allows implementers to choose their own encoding strategy. I don't think it can be a ProtocolWrapper, because it'll need to know about the particulars of IRCClient. That means I don't have a clear idea of the interface yet. Until I do, I'd prefer we ship something that implements the RFC and allows people to do handling encoding the way they see fit. I will say I'm happy to take a stab at a recoder. But it can't be written with IRCClient as it stands now and would certainly be done in a separate PR. Shipping what we have now will mean we're putting bugs out there (see the line length issues called out in the ticket) and an interface I think we haven't thought through, but that certainly limits what IRC protocol messages you can receive. (Also - I don't think any multibyte UTF-8 sequence can contain a byte <= 127, so that it can't be truncated by ASCII-only code. This of course isn't true for fixed-width encodings. '\n\n' is a totally valid UTF-16 sequence.)
So, the thing IRC is transmitting is text. The way it's transmitting it is poorly specified and will need manual configurability hooks to specify encoding information, probably forever, and perhaps even to guess it (although "encoding=chardet" would be nice). I agree that just saying "UTF-8 or GTFO" is not a sustainable approach at all. "UTF-8 or have a bad time with this fiddly customization API and config file" is fine, because anyone wanting something else is probably already having a bad time.
If you are engaging in a real abuse of the IRC protocol and you're treating it as an 8-bit clean stream to send some escaped binary data through (like a video stream, something like that), well, that's what the 'charmap' alias of 'latin-1' is for :-).
I guess charmap could be used to implement the recovery scheme I keep talking about, but then we'd be telling people to work out the recoding interaction between IRCClient and their own implementation. I'd like to provide a defined way of doing so eventually.
So... have I sold you?
On default UTF-8? Absolutely! But I don't know exactly the way to do it, so I'd rather provide a Python 3 port that actually implements the protocol, and then work out a nice recoding API. Thanks for taking the time to talk through this. I appreciate it! -Mark
On Nov 20, 2016, at 19:35, Mark Williams <markrwilliams@gmail.com> wrote:
On Fri, Nov 18, 2016 at 05:36:16PM -0800, Glyph Lefkowitz wrote:
"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed?
Yes. Here's the lede:
Thank you for summarizing! Point by point, here's my position:
IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like thing that encodes and decodes command prefixes and parameters.
I disagree. Any user-facing API should deal in unicode objects. (There is one caveat here; there really should be a separate layer for dealing with text; IRCClient being a subclassing-based API pollutes the whole issue. But that API shouldn't be public, so this is largely minutae; the "right" answer here has nothing to do with bytes or text and everything to do with adopting .)
It should implement an interface, and we can start with an implementation that only knows about UTF-8.
We should have the implementation initially know about UTF-8, yes.
The obvious advantage of this is that you can more easily write IRCClients that work on both Python 2 and 3.
This is the part that I'm worried about. It kinda seems like we're moving toward "native string" being the type used in IRCClient, and that is capital-W Wrong. Native strings are for Python-native types only, i.e. docstrings and method names.
I'm also not entirely sure of the consequences of this interface change. I think it deserves more thought before it becomes an API that we have to support. This is the primary reason I opened the revert PR.
One of the things that's informing my decision is that IRCClient is already an incredibly ill-defined API that probably needs to be deprecated and overhauled at some point. However, in the intervening (what will almost certainly be a) decade, I'd like it to work on Python 3.
I'm more precisely worried about the fact that the implementation raises a decoding exception that cannot be handled in user code when it receives non-UTF-8 messages,
The right way to deal with this is twofold: Add the ability to specify both the "encoding" and the "errors" of the relevant codec <https://docs.python.org/2.7/library/codecs.html#codecs.decode <https://docs.python.org/2.7/library/codecs.html#codecs.decode>>, so that we can choose error handling strategies. (potentially, if you have very nuanced requirements for dealing with weird encodings) write a codec that logs and handles its own errors. (We probably shouldn't be logging a traceback for encoding problems regardless, if it's UnicodeDecodeError. But that's something that can easily be fixed in subsequent releases as well)
and the fact that the line length checks occur prior to encoding, ensuring mid-codepoint truncation. These issues also contributed to my revert.
Line length checks are a super interesting example because I think they also illustrate my concerns as well. To properly do message-splitting (which is why we're checking line length), you have to: check the length in octets (because it's actually a message-length limit in octets, not a line-length limit in characters) split the textual representation - ideally somewhere relevant like a word break, which you can only detect in text! try encoding again and ensure that the encoded representation is the correct length, repeating if necessary. This is an implementation-level bug though, not an interface-level one, so I'm also comfortable fixing this bug in the future.
My points are, separately:
IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".
It's nonsensical that it be finally presented to a human as raw bytes. I'm advocating for the decision to be made as late as possible. That doesn't mean we can't provide an easy-to-use recoding client that we encourage people to turn to first.
You can't process it as bytes either, though. In some cases you think you can, but then you get mid-codepoint truncation :-).
But we can't have *a* fallback encoding. My encoding detector program indicates that latin-1 is the second most popular encoding for European IRC servers, but Russian servers I sampled (not in netsplit.de's top 10) used a variety of Cyrillic encodings.
If you really want to do something this sophisticated (and, I should note: no other IRC clients or bots I'm aware of do, so I think you've got an unrealistically tight set of requirements) then you can just write your own single codec that composes a bunch of others, and install it. Python's encoding system is extensible for exactly this reason :).
I also want to enable arbitrary recovery strategies for bad encodings.
This is totally not an IRC-specific thing though :-).
For instance, in the case that an IRC client or server truncates a code point at a line boundary, it might be the right idea to binary search until the invalid byte sequence is found, and then exclude it. It might be the right idea to buffer the message for a time in the hopes that the codepoint got split over two lines.
And what if somebody wants to run another encoding survey?
Decode as charmap, which is what we call latin-1 when we want to do this :). That's a super edge-case, and should not be easy by default.
I don't expect most users to do any of that, but *I* certainly want to without having to copy and paste a bunch of code.
You can totally do all of these things once we can specify an encoding.
<arabic USB drive>
My real point was that dealing with bad encodings is not theoretical. Nobody knew the encoding, by the way; they just knew the USB drive worked for some of them and not others, and were resulting to printing things out or taking screen shots.
Sure, sorry for my sarcastic retort. The example is totally germane; I didn't mean to say it wasn't.
That's the situation opinionated software with monolithic abstractions creates. People *will* find workarounds that are terrible for a bunch of reasons. I can vouch for the utility of tools that decide on encodings as late as possible.
Wouldn't it have been great if you couldn't create this mess in the first place, though? The ability to recover is good (and being able to specify the encoding, and write your own custom codec, for IRC is certainly important).
I don't remember either. But, now the driver *allows* me to do that without requiring it, and also allows me to mount the file system so that the paths are exposed as bytes. Since nobody knew the encoding, that was essential to letting me use mlterm to determine it. Nowadays I'd probably use chardet but would still need the raw bytes.
Using latin-1 in this scenario would have worked as well, though.
And as far as I know code point sequence truncation can also occur on FAT16/32 partitions. In the event of such truncation the automatic decoding would only prevent me from mounting the partition. I'm thankful that the implementation allows me to choose a recovery strategy in a very real edge case. If it didn't, I'd have to look up the file system's on disk format and reimplement 99% of a FAT16 driver to get at the data.
OK, now we're getting into some real filesystem esoterica which I'm not sure applies any more :-).
Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.
For what it's worth, I want to make it easy to use UTF-8. I just don't want to make it hard to use an encoding that's *not* UTF-8.
I want to make it a little hard. Having a version floating around for a few releases that only supports UTF-8 creates gentle social pressure for everyone to fix their encodings. Later releasing the version that supports arbitrary stuff including chardet addresses the long tail of brokenness that can't be fixed by a nudge.
It makes more sense to have an implementation that parses protocol elements as bytes and provides a bytes API. It's fine to also provide a decoded text API, but not to the exclusion of bytes.
This is the point where I think we diverge. I don't think adding a bytes API actually adds any value. Trying to process the contents of IRC as bytes as any way leads to inevitable failures ("line" truncation midway through a UTF-8 escape sequence for example).
This is precisely where we disagree. As I described above, I can think of a couple ways to handle mid-codepoint truncation. A Twisted-based IRC client should have the option to implement its own. The end result would still be text (or at least an informative log message.)
OK, this is definitely the part where we diverge. If you care so much about the hairsplitting specifics of IRC byte handling that you want to change the line-splitting algorithm to do something specific, you should be maintaining Twisted, not writing applications with it. I suppose I should reveal my bias here: IRC is a garbage protocol, and its implementations' main utility should be upward compatibility with something more modern, maybe a line-delimited JSON thing, since XMPP doesn't seem to have taken off. That thing hasn't arrived yet, whatever it will be, but when we present an application-level interface to it, we should strip away as much IRC-specific junk as we can, while still maintaining enough specificity that consumers of the API can provoke specific desired user-facing behaviors in user interfaces (for example, preserving the distinction between "notice" and "message"). Twisted's IRC support's job, in my mind, is to support applications that want to interact with users and servers, and possibly process messages in between. You can't process messages as bytes (see mid-codepoint truncation above), so presenting a bytes-oriented interface is useless for this whole class of application, not just for the final step where the message is presented to a human. Presenting this low-level interface to enable users the ability to customize line-splitting is just bonkers.
I think the best way to handle this is to have a bytes-only IRC client that can then be wrapped with something that decodes prefixes and parameters. We can provide a UTF-8 recoder that people are encouraged to use, and an interface that allows implementers to choose their own encoding strategy.
At the risk of repeating myself, the way you select an encoding strategy in Python is selecting an encoding :).
I will say I'm happy to take a stab at a recoder.
You've used this word a few times - what is a "recoder"?
Shipping what we have now will mean we're putting bugs out there (see the line length issues called out in the ticket) and an interface I think we haven't thought through, but that certainly limits what IRC protocol messages you can receive.
I'm OK with there being edge-case bugs like this: we should fix them one at a time. Smaller PRs are better, even if it means not everything works perfectly in every release.
I guess charmap could be used to implement the recovery scheme I keep talking about, but then we'd be telling people to work out the recoding interaction between IRCClient and their own implementation. I'd like to provide a defined way of doing so eventually.
As Kay put it, simple things should be easy, and complex things should be possible. I am happy with this tradeoff - writing this weird transcoding nexus IRC proxy application _should_ be kind of hard ;). Writing a bot that spits out emoji in response to jokes should be easy. (And you can't even encode emoji in KOI-8, so.)
So... have I sold you?
On default UTF-8? Absolutely! But I don't know exactly the way to do it, so I'd rather provide a Python 3 port that actually implements the protocol, and then work out a nice recoding API.
Thanks for taking the time to talk through this. I appreciate it!
Sorry to say my final call (as backed up by Amber, apropos of our earlier IRC conversation (WHICH I SHOULD NOTE WAS CONDUCTED USING UTF-8 TEXT!!!)) is not to act on the revert. But you raise many valid issues and I hope that we can get those nailed by the next release as just regular old bugfixes :). This has been a great conversation though, I hope we can have more like it on the mailing list :). -glyph
On Tue, 22 Nov 2016 at 23:37 Glyph Lefkowitz <glyph@twistedmatrix.com> wrote:
This is the part that I'm worried about. It kinda seems like we're moving toward "native string" being the type used in IRCClient, and *that* is capital-W Wrong. Native strings are for Python-native types only, i.e. docstrings and method names.
Unless I'm misunderstanding, we're not "moving towards" it, we have *already arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode) on Python 3. Even if we want a unicode API, having it only exist on Python 3 seems incredibly confusing from a user standpoint, and would appear to require some absurd contortions to write client code that behaves approximately the same on both Python 2 and 3.
On Wed, 23 Nov 2016 at 01:14 Tristan Seligmann <mithrandi@mithrandi.net> wrote:
On Tue, 22 Nov 2016 at 23:37 Glyph Lefkowitz <glyph@twistedmatrix.com> wrote:
This is the part that I'm worried about. It kinda seems like we're moving toward "native string" being the type used in IRCClient, and *that* is capital-W Wrong. Native strings are for Python-native types only, i.e. docstrings and method names.
Unless I'm misunderstanding, we're not "moving towards" it, we have *already arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode) on Python 3. Even if we want a unicode API, having it only exist on Python 3 seems incredibly confusing from a user standpoint, and would appear to require some absurd contortions to write client code that behaves approximately the same on both Python 2 and 3.
For example, as far as I can tell, the only way to write code to join a channel named #tëst (UTF-8 encoded) is: channel = u'#tëst' if PY3: channel = channel.encode('utf-8') client.join(channel) On Python 3, client.join(b'#t\xc3\xab') will try to send JOIN b'#t\xc3\xab', which is garbage, whereas on Python 2, client.join(u'#t\xebst') will produce a UnicodeEncodeError.
On Nov 22, 2016, at 18:27, Tristan Seligmann <mithrandi@mithrandi.net> wrote:
On Wed, 23 Nov 2016 at 01:26 Tristan Seligmann <mithrandi@mithrandi.net <mailto:mithrandi@mithrandi.net>> wrote: if PY3:
Argh, the above should be if PY2 of course.
OK, this whole time I thought we were talking about a sensible application of text_type to the API, perhaps with some leniency for bytes-ish-ness on python 2. I haven't reviewed the PR, I was just responding to the concerns as raised on the list. If it's just randomly encoding on one version and not the other, and correct usage of the API depends on *users* doing 'if PY2:' in their own code, then perhaps Mark's concern is indeed well-founded and we should roll it back before 16.6. -glyph
On Tue, Nov 22, 2016 at 06:31:45PM -0500, Glyph Lefkowitz wrote:
OK, this whole time I thought we were talking about a sensible application of text_type to the API, perhaps with some leniency for bytes-ish-ness on python 2. I haven't reviewed the PR, I was just responding to the concerns as raised on the list.
Sorry - I didn't mean to steer this towards API bike shedding.
If it's just randomly encoding on one version and not the other, and correct usage of the API depends on *users* doing 'if PY2:' in their own code, then perhaps Mark's concern is indeed well-founded and we should roll it back before 16.6.
Tristan's exactly right. Furthermore, if we decide to make IRCClient call its various command methods with unicode strings on Python 2, we'll be breaking backwards compatibility. This is what I meant when I wrote: On Nov 20, 2016, at 19:35, Mark Williams <markrwilliams@gmail.com> wrote:
Yes. Here's the lede: IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like thing that encodes and decodes command prefixes and parameters. It should implement an interface, and we can start with an implementation that only knows about UTF-8. The obvious advantage of this is that you can more easily write IRCClients that work on both Python 2 and 3.
But it totally wasn't clear - sorry! Of course, I also want IRC client implementation that lets me get at bytes, but that's a discussion I'll move to a new thread. Given the inconsistency between Python 2 and Python 3, do we proceed with the revert? -Mark
On Nov 22, 2016, at 20:27, Mark Williams <markrwilliams@gmail.com> wrote:
On Tue, Nov 22, 2016 at 06:31:45PM -0500, Glyph Lefkowitz wrote:
OK, this whole time I thought we were talking about a sensible application of text_type to the API, perhaps with some leniency for bytes-ish-ness on python 2. I haven't reviewed the PR, I was just responding to the concerns as raised on the list.
Sorry - I didn't mean to steer this towards API bike shedding.
If it's just randomly encoding on one version and not the other, and correct usage of the API depends on *users* doing 'if PY2:' in their own code, then perhaps Mark's concern is indeed well-founded and we should roll it back before 16.6.
Tristan's exactly right. Furthermore, if we decide to make IRCClient call its various command methods with unicode strings on Python 2, we'll be breaking backwards compatibility. This is what I meant when I wrote:
On Nov 20, 2016, at 19:35, Mark Williams <markrwilliams@gmail.com> wrote:
Yes. Here's the lede: IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like thing that encodes and decodes command prefixes and parameters. It should implement an interface, and we can start with an implementation that only knows about UTF-8. The obvious advantage of this is that you can more easily write IRCClients that work on both Python 2 and 3.
But it totally wasn't clear - sorry!
Of course, I also want IRC client implementation that lets me get at bytes, but that's a discussion I'll move to a new thread.
Given the inconsistency between Python 2 and Python 3, do we proceed with the revert?
Okay. So. The rule for reverts like this is: if you do something today, which is correct usage of the API and produces an observably correct result, will that be broken in the future if we fix it? If so, then we need to revert because the interface as released is unsupportable. As it stands, we have a matrix of 4 behaviors: bytes text(ascii) text(nonascii) py2 works works UnicodeDecodeError py3 garbage works works This... is actually... fine, surprisingly. The right thing to do is to write code that passes text all the time. If you do that right now, it'll work on py3 and raise an exception on py2, unless it happens to be ASCII, in which case it'll work. If you write code that passes bytes on py3, it'll just be garbage. But, we want to deprecate that anyway, and you can't get correct, usable behavior out of it, no matter what workarounds you stuff in; so it's a bug, and can be fixed like any bug. Similarly if you pass non-ascii text on py3, you'll get a UnicodeDecodeError. This is not a good situation, but it's totally fixable without breaking the interface. We just fix the py2 version to accept text_type as well, and if Mark sneaks in a patch that makes py3 do the right thing with bytes, well, I don't know that I can stop him. More importantly, it would probably be a smaller change to fix the methods (we could even fix them one at a time; say, action, join, etc) than to un-port and re-port the whole thing. So: yes, it's broken, and in a worse way than I thought. To get it to the point where we can actually implement logic consistently between two versions, we need to add a flag to IRCClient's constructor which is default-false on py2 and default-true on py3 which says "give me text", so that callbacks like privmsg and joined can start receiving text_type on py2 as well as py3; right now it has to receive str because they've previously received str. But that's a separate issue. I am open to the idea that I have evaluated this incorrectly though, since this has been possibly the most confusing change since https://twistedmatrix.com/trac/ticket/411 <https://twistedmatrix.com/trac/ticket/411>. But as of right now I still think we shouldn't revert. -glyph
Been lurking here, no cows in the fire, no irons in the race, or whatever, except wanting Twisted to be perfect and easy to use and being perennially confused by text encoding, but I did notice this: On 11/22/2016 9:03 PM, Glyph Lefkowitz wrote: [...]
Okay. So.
The rule for reverts like this is: if you do something today, which is correct usage of the API and produces an observably correct result, will that be broken in the future if we fix it? If so, then we need to revert because the interface as released is unsupportable.
As it stands, we have a matrix of 4 behaviors:
*bytes*
*text(ascii)*
*text(nonascii)* *py2*
works
works
UnicodeDecodeError *py3*
garbage
works
works
This... is actually... fine, surprisingly.
The /right/ thing to do is to write code that passes text all the time. If you do that right now, it'll work on py3 and raise an exception on py2, unless it /happens/ to be ASCII, in which case it'll work.
If you write code that passes bytes on py3, it'll just be garbage. But, we want to deprecate that anyway, and you can't get correct, usable behavior out of it, no matter what workarounds you stuff in; so it's a bug, and can be fixed like any bug.
Similarly if you pass non-ascii text on py3, you'll get a UnicodeDecodeError.
Shouldn't this be "if you pass non-ascii text on *py2, *you'll get ..." ? [...]
-glyph
Pedantically yours, -- John Santos Evans Griffiths & Hart, Inc. 781-861-0670 ext 539
On Tuesday, November 22, 2016, Glyph Lefkowitz <glyph@twistedmatrix.com> wrote:
Okay. So.
The rule for reverts like this is: if you do something today, which is correct usage of the API and produces an observably correct result, will that be broken in the future if we fix it? If so, then we need to revert because the interface as released is unsupportable.
As it stands, we have a matrix of 4 behaviors:
*bytes* *text(ascii)* *text(nonascii)* *py2* works works UnicodeDecodeError *py3* garbage works works
This... is actually... fine, surprisingly.
Given that matrix, how would this work on Python 2 and 3: https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf79... And how would that code not have to change if a future release accommodates Unicode on Python 2 or bytes on Python 3?
On Nov 22, 2016, at 21:36, Mark Williams <markrwilliams@gmail.com> wrote:
On Tuesday, November 22, 2016, Glyph Lefkowitz <glyph@twistedmatrix.com <mailto:glyph@twistedmatrix.com>> wrote:
Okay. So.
The rule for reverts like this is: if you do something today, which is correct usage of the API and produces an observably correct result, will that be broken in the future if we fix it? If so, then we need to revert because the interface as released is unsupportable.
As it stands, we have a matrix of 4 behaviors:
bytes text(ascii) text(nonascii) py2 works works UnicodeDecodeError py3 garbage works works
This... is actually... fine, surprisingly.
Given that matrix, how would this work on Python 2 and 3:
https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf79... <https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf7921c23c/master/buildbot/reporters/irc.py#L67-L68> It wouldn't work on Python 3 yet. But that's fine: the point is that it wouldn't work! Buildbot can just block porting on that bug.
And how would that code not have to change if a future release accommodates Unicode on Python 2 or bytes on Python 3?
Because it will get broken / undefined behavior on the current implementation. We can always fix broken behavior! What we can't do is fix broken behavior that also breaks other correct behavior or workarounds. But in this case, there's a broken behavior (which we have on trunk) and a correct behavior (which we can implement in the future) and no way to coerce the broken behavior to do something valid via public API. -glyph
participants (5)
-
Amber "Hawkie" Brown
-
Glyph Lefkowitz
-
John Santos
-
Mark Williams
-
Tristan Seligmann