On Fri, Nov 18, 2016 at 05:36:16PM -0800, Glyph Lefkowitz wrote:
"doesn't work" is a pretty black-and-white assessment. Are you anticipating a problem with the way the interface is specified that it can't be easily changed?
Yes. Here's the lede: IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like thing that encodes and decodes command prefixes and parameters. It should implement an interface, and we can start with an implementation that only knows about UTF-8. The obvious advantage of this is that you can more easily write IRCClients that work on both Python 2 and 3. I'll attempt to explain others in the rest of this email.
I should say up front here that I think I was being too emphatic in my support for UTF-8.
Test regressions are listed because they're unambiguously cause for a revert; "undesirable" is intentionally vague because we might decide to revert a thing for no reason. I guess opening a PR for a discussion like this is reasonable.
Good to know!
This could be considered an incompatible interface change; I'm honestly not sure about the exact type signatures of various methods to say whether it is or not.
I'm also not entirely sure of the consequences of this interface change. I think it deserves more thought before it becomes an API that we have to support. This is the primary reason I opened the revert PR.
I'm more precisely worried about the fact that the implementation raises a decoding exception that cannot be handled in user code when it receives non-UTF-8 messages, and the fact that the line length checks occur prior to encoding, ensuring mid-codepoint truncation. These issues also contributed to my revert.
My points are, separately:
IRC is text. It's nonsensical to process it as bytes, because you can't process it as bytes. This is separate from the question of "what encoding is IRC".
It's nonsensical that it be finally presented to a human as raw bytes. I'm advocating for the decision to be made as late as possible. That doesn't mean we can't provide an easy-to-use recoding client that we encourage people to turn to first.
UTF-8 is good. There should be gradual social pressure to use UTF-8 everywhere (I'm a fan of http://utf8everywhere.org http://utf8everywhere.org/). This is especially true in protocols like IRC and filenames where there's no mechanism to specify an encoding so that it can be correctly decoded. Therefore: an initial release which features UTF-8 only is fine; therefore there's no need to do a revert. defaulting to UTF-8 is reasonable for the forseeable future; users should only change this if they know that they want something unusual. IRC is an incompatible and broken wasteland; thanks to your quantitative research we know exactly how broken. Therefore: "support alternate encodings" is a valuable feature. Supporting point 2.1, this feature can be added on at any later point, making a revert of the present implementation unnecessary. We can, and should, just go ahead and add support for alternate (per-server, per-channel, per-user) default and fallback encodings. We should always have a fallback encoding, since blowing up on "invalid" data on a protocol where there's no standard to say what is or isn't valid doesn't seem very helpful.
I appreciate the consistency of this, and agree the documented preference should be a client implementation that assumes UTF-8. But we can't have *a* fallback encoding. My encoding detector program indicates that latin-1 is the second most popular encoding for European IRC servers, but Russian servers I sampled (not in netsplit.de's top 10) used a variety of Cyrillic encodings.
I also want to enable arbitrary recovery strategies for bad encodings. For instance, in the case that an IRC client or server truncates a code point at a line boundary, it might be the right idea to binary search until the invalid byte sequence is found, and then exclude it. It might be the right idea to buffer the message for a time in the hopes that the codepoint got split over two lines.
And what if somebody wants to run another encoding survey?
I don't expect most users to do any of that, but *I* certainly want to without having to copy and paste a bunch of code.
When I received Arabic PDFs on a FAT16 USB drive with filenames in CP1256, I had to switch mlterm to that particular code page to read the directory listings so I could use convmv to convert them to UTF-8.
There is no question that your life has been hard, and that a wide array of people have made bad decisions that contribute to your difficulties. :-)
My real point was that dealing with bad encodings is not theoretical. Nobody knew the encoding, by the way; they just knew the USB drive worked for some of them and not others, and were resulting to printing things out or taking screen shots.
That's the situation opinionated software with monolithic abstractions creates. People *will* find workarounds that are terrible for a bunch of reasons. I can vouch for the utility of tools that decide on encodings as late as possible.
Note that I'm not asking that we be everything to all people, but rather that we allow people the option of dealing with the IRC encoding disaster the way they see fit.
But, Linux's FAT16 driver has decided that.
The correct way to solve your problem with current Linux (I don't know if this was possible at the time) would be to address it with mount, not special user-space software. Specifically, I think it would be something like:
mount -t fat -o fat=16,iocharset=utf-8,codepage=1256 /dev/disk/by-label/arabic.msdos /media/arabic.msdos
Now all your GTK+ software works, too, because you're not trying to reconcile your legacy format support at the application level.
I don't remember either. But, now the driver *allows* me to do that without requiring it, and also allows me to mount the file system so that the paths are exposed as bytes. Since nobody knew the encoding, that was essential to letting me use mlterm to determine it. Nowadays I'd probably use chardet but would still need the raw bytes.
And as far as I know code point sequence truncation can also occur on FAT16/32 partitions. In the event of such truncation the automatic decoding would only prevent me from mounting the partition. I'm thankful that the implementation allows me to choose a recovery strategy in a very real edge case. If it didn't, I'd have to look up the file system's on disk format and reimplement 99% of a FAT16 driver to get at the data.
So it's the case that raw bytes weren't useful to me when I tried to actually read the paths, but they were super useful to me when a perfectly reasonable assumption was wrong. And when no encoding is mandated, perfectly reasonable assumptions do fail and fail often.
What if I want to write write a bot that bridges two IRC networks?
In the current release, yes. But in a future release: no, you can't just bridge arbitrary bytes between two networks and expect them to work. Those networks (or channels, or users) might have different implicit encoding rules; which, by default and only by default, should be utf-8. In a multi-encoding world, you may need to transcode between them to properly bridge; this is a consequence of the fact that eventually you're presenting this data as text to human eyeballs.
It's true that if one channel is latin-1 and the other is MacCyrillic that a text-only IRCClient implementation could handle this just by allowing the user to choose an encoding. The recoding API I'm talking about wouldn't give you anything. But it would help with truncation issues and channels' topics using different encodings.
But none of this is actually true. What seems to be true is that non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes to regularly seen on many other IRC servers. These encodings are certainly used.
I can't really parse you here - are you saying that each network more or less sticks to one encoding?
Not quite - I meant that in my survey, I saw no latin-1 on Freenode, but that may be because they decided I was abusing the network early on in my attempt to list and join channel.
But on other networks I saw a lot of different encodings, used across different channels, so that the channel list contained topics encoded in many different 8-bit encodings.
Sorry, my statement you were responding to here was way too strong. What I meant to say here is that long term there is no way to get a "right" answer in this ecosystem, so "UTF-8 is the only correct answer" is the only direction we can push in to actually make things work reasonably by default an increasing proportion of the time. For the forseeable future, adding the ability to cope with other encodings (encoding a fallback to latin-1 so that you can at least do demojibakefication manually after copy/pasting) is something a general-purpose IRC library absolutely needs. This is why every client has an "encoding" selection menu, too.
For what it's worth, I want to make it easy to use UTF-8. I just don't want to make it hard to use an encoding that's *not* UTF-8.
It makes more sense to have an implementation that parses protocol elements as bytes and provides a bytes API. It's fine to also provide a decoded text API, but not to the exclusion of bytes.
This is the point where I think we diverge. I don't think adding a bytes API actually adds any value. Trying to process the contents of IRC as bytes as any way leads to inevitable failures ("line" truncation midway through a UTF-8 escape sequence for example).
This is precisely where we disagree. As I described above, I can think of a couple ways to handle mid-codepoint truncation. A Twisted-based IRC client should have the option to implement its own. The end result would still be text (or at least an informative log message.)
I think the best way to handle this is to have a bytes-only IRC client that can then be wrapped with something that decodes prefixes and parameters. We can provide a UTF-8 recoder that people are encouraged to use, and an interface that allows implementers to choose their own encoding strategy.
I don't think it can be a ProtocolWrapper, because it'll need to know about the particulars of IRCClient. That means I don't have a clear idea of the interface yet. Until I do, I'd prefer we ship something that implements the RFC and allows people to do handling encoding the way they see fit. I will say I'm happy to take a stab at a recoder. But it can't be written with IRCClient as it stands now and would certainly be done in a separate PR.
Shipping what we have now will mean we're putting bugs out there (see the line length issues called out in the ticket) and an interface I think we haven't thought through, but that certainly limits what IRC protocol messages you can receive.
(Also - I don't think any multibyte UTF-8 sequence can contain a byte <= 127, so that it can't be truncated by ASCII-only code. This of course isn't true for fixed-width encodings. '\n\n' is a totally valid UTF-16 sequence.)
So, the thing IRC is transmitting is text. The way it's transmitting it is poorly specified and will need manual configurability hooks to specify encoding information, probably forever, and perhaps even to guess it (although "encoding=chardet" would be nice). I agree that just saying "UTF-8 or GTFO" is not a sustainable approach at all. "UTF-8 or have a bad time with this fiddly customization API and config file" is fine, because anyone wanting something else is probably already having a bad time.
If you are engaging in a real abuse of the IRC protocol and you're treating it as an 8-bit clean stream to send some escaped binary data through (like a video stream, something like that), well, that's what the 'charmap' alias of 'latin-1' is for :-).
I guess charmap could be used to implement the recovery scheme I keep talking about, but then we'd be telling people to work out the recoding interaction between IRCClient and their own implementation. I'd like to provide a defined way of doing so eventually.
So... have I sold you?
On default UTF-8? Absolutely! But I don't know exactly the way to do it, so I'd rather provide a Python 3 port that actually implements the protocol, and then work out a nice recoding API.
Thanks for taking the time to talk through this. I appreciate it!