I'm also not entirely sure of the consequences of this interface
change. I think it deserves more thought before it becomes an API
that we have to support. This is the primary reason I opened the
One of the things that's informing my decision is that IRCClient is already an incredibly ill-defined API that probably needs to be deprecated and overhauled at some point. However, in the intervening (what will almost certainly be a) decade, I'd like it to work on Python 3.
I'm more precisely worried about the fact that the implementation
raises a decoding exception that cannot be handled in user code when
it receives non-UTF-8 messages,
The right way to deal with this is twofold:
- Add the ability to specify both the "encoding" and the "errors" of the relevant codec <https://docs.python.org/2.7/library/codecs.html#codecs.decode>, so that we can choose error handling strategies.
- (potentially, if you have very nuanced requirements for dealing with weird encodings) write a codec that logs and handles its own errors. (We probably shouldn't be logging a traceback for encoding problems regardless, if it's UnicodeDecodeError. But that's something that can easily be fixed in subsequent releases as well)
and the fact that the line length checks occur prior to encoding, ensuring mid-codepoint truncation. These issues also contributed to my revert.
Line length checks are a super interesting example because I think they also illustrate my concerns as well.
To properly do message-splitting (which is why we're checking line length), you have to:
- check the length in octets (because it's actually a message-length limit in octets, not a line-length limit in characters)
- split the textual representation - ideally somewhere relevant like a word break, which you can only detect in text!
- try encoding again and ensure that the encoded representation is the correct length, repeating if necessary.
This is an implementation-level bug though, not an interface-level one, so I'm also comfortable fixing this bug in the future.
My points are, separately:
IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".
It's nonsensical that it be finally presented to a human as raw bytes.
I'm advocating for the decision to be made as late as possible. That
doesn't mean we can't provide an easy-to-use recoding client that we
encourage people to turn to first.
You can't process it as bytes either, though. In some cases you think you can, but then you get mid-codepoint truncation :-).
But we can't have *a* fallback encoding. My encoding detector program
indicates that latin-1 is the second most popular encoding for
European IRC servers, but Russian servers I sampled (not in
netsplit.de's top 10) used a variety of Cyrillic encodings.
If you really want to do something this sophisticated (and, I should note: no other IRC clients or bots I'm aware of do, so I think you've got an unrealistically tight set of requirements) then you can just write your own single codec that composes a bunch of others, and install it. Python's encoding system is extensible for exactly this reason :).
I also want to enable arbitrary recovery strategies for bad encodings.
This is totally not an IRC-specific thing though :-).
For instance, in the case that an IRC client or server truncates a
code point at a line boundary, it might be the right idea to binary
search until the invalid byte sequence is found, and then exclude it.
It might be the right idea to buffer the message for a time in the
hopes that the codepoint got split over two lines.
And what if somebody wants to run another encoding survey?
Decode as charmap, which is what we call latin-1 when we want to do this :). That's a super edge-case, and should not be easy by default.
I don't expect most users to do any of that, but *I* certainly want to
without having to copy and paste a bunch of code.
You can totally do all of these things once we can specify an encoding.
<arabic USB drive>
My real point was that dealing with bad encodings is not theoretical.
Nobody knew the encoding, by the way; they just knew the USB drive
worked for some of them and not others, and were resulting to printing
things out or taking screen shots.
Sure, sorry for my sarcastic retort. The example is totally germane; I didn't mean to say it wasn't.
That's the situation opinionated software with monolithic abstractions
creates. People *will* find workarounds that are terrible for a bunch
of reasons. I can vouch for the utility of tools that decide on
encodings as late as possible.
Wouldn't it have been great if you couldn't create this mess in the first place, though? The ability to recover is good (and being able to specify the encoding, and write your own custom codec, for IRC is certainly important).
I don't remember either. But, now the driver *allows* me to do that
without requiring it, and also allows me to mount the file system so
that the paths are exposed as bytes. Since nobody knew the encoding,
that was essential to letting me use mlterm to determine it. Nowadays
I'd probably use chardet but would still need the raw bytes.
Using latin-1 in this scenario would have worked as well, though.
And as far as I know code point sequence truncation can also occur on
FAT16/32 partitions. In the event of such truncation the automatic
decoding would only prevent me from mounting the partition. I'm
thankful that the implementation allows me to choose a recovery
strategy in a very real edge case. If it didn't, I'd have to look up
the file system's on disk format and reimplement 99% of a FAT16 driver
to get at the data.
OK, now we're getting into some real filesystem esoterica which I'm not sure applies any more :-).
Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.
For what it's worth, I want to make it easy to use UTF-8. I just
don't want to make it hard to use an encoding that's *not* UTF-8.
I want to make it a little hard. Having a version floating around for a few releases that only supports UTF-8 creates gentle social pressure for everyone to fix their encodings. Later releasing the version that supports arbitrary stuff including chardet addresses the long tail of brokenness that can't be fixed by a nudge.
It makes more sense to have an implementation that parses protocol
elements as bytes and provides a bytes API. It's fine to also provide
a decoded text API, but not to the exclusion of bytes.
This is the point where I think we diverge. I don't think adding a bytes API actually adds any value. Trying to process the contents of IRC as bytes as any way leads to inevitable failures ("line" truncation midway through a UTF-8 escape sequence for example).
This is precisely where we disagree. As I described above, I can
think of a couple ways to handle mid-codepoint truncation. A
Twisted-based IRC client should have the option to implement its own.
The end result would still be text (or at least an informative log
OK, this is definitely the part where we diverge.
If you care so much about the hairsplitting specifics of IRC byte handling that you want to change the line-splitting algorithm to do something specific, you should be maintaining Twisted, not writing applications with it.
I suppose I should reveal my bias here: IRC is a garbage protocol, and its implementations' main utility should be upward compatibility with something more modern, maybe a line-delimited JSON thing, since XMPP doesn't seem to have taken off. That thing hasn't arrived yet, whatever it will be, but when we present an application-level interface to it, we should strip away as much IRC-specific junk as we can, while still maintaining enough specificity that consumers of the API can provoke specific desired user-facing behaviors in user interfaces (for example, preserving the distinction between "notice" and "message").
Twisted's IRC support's job, in my mind, is to support applications that want to interact with users and servers, and possibly process messages in between. You can't process messages as bytes (see mid-codepoint truncation above), so presenting a bytes-oriented interface is useless for this whole class of application, not just for the final step where the message is presented to a human. Presenting this low-level interface to enable users the ability to customize line-splitting is just bonkers.
I think the best way to handle this is to have a bytes-only IRC client
that can then be wrapped with something that decodes prefixes and
parameters. We can provide a UTF-8 recoder that people are encouraged
to use, and an interface that allows implementers to choose their own
At the risk of repeating myself, the way you select an encoding strategy in Python is selecting an encoding :).
I will say I'm happy to take a stab at a recoder.
You've used this word a few times - what is a "recoder"?
Shipping what we have now will mean we're putting bugs out there (see
the line length issues called out in the ticket) and an interface I
think we haven't thought through, but that certainly limits what IRC
protocol messages you can receive.
I'm OK with there being edge-case bugs like this: we should fix them one at a time. Smaller PRs are better, even if it means not everything works perfectly in every release.
I guess charmap could be used to implement the recovery scheme I keep
talking about, but then we'd be telling people to work out the
recoding interaction between IRCClient and their own implementation.
I'd like to provide a defined way of doing so eventually.
As Kay put it, simple things should be easy, and complex things should be possible. I am happy with this tradeoff - writing this weird transcoding nexus IRC proxy application _should_ be kind of hard ;). Writing a bot that spits out emoji in response to jokes should be easy. (And you can't even encode emoji in KOI-8, so.)
So... have I sold you?
On default UTF-8? Absolutely! But I don't know exactly the way to do
it, so I'd rather provide a Python 3 port that actually implements the
protocol, and then work out a nice recoding API.
Thanks for taking the time to talk through this. I appreciate it!
Sorry to say my final call (as backed up by Amber, apropos of our earlier IRC conversation (WHICH I SHOULD NOTE WAS CONDUCTED USING UTF-8 TEXT!!!)) is not to act on the revert. But you raise many valid issues and I hope that we can get those nailed by the next release as just regular old bugfixes :).
This has been a great conversation though, I hope we can have more like it on the mailing list :).