[Python-ideas] Processing surrogates in

Andrew Barnert abarnert at yahoo.com
Fri May 15 21:37:57 CEST 2015


On May 15, 2015, at 11:19, random832 at fastmail.us wrote:
> 
>> On Thu, May 14, 2015, at 16:17, Andrew Barnert wrote:
>> The point is that you can miscount lengths by counting the BOM, and you
>> can split a BOM stream into a BOM steam and an "I hope it's in native
>> order or we're screwed" stream.
> 
> Python provides no operations for splitting streams. You mention
> re.split further on, but that only works on in-memory strings, which
> should have already had the BOM stripped and been put in native order.

If you're decoding to text, you don't have UTF-16 anymore (or, if you do under the covers, you neither know nor care that you do), you have Unicode text. 

Conversely, if you have UTF-16--even in native order and with the BOM stripped--you don't have text, you still have bytes (or WCHARs, if you prefer, but not in Python).

Why would you want to transcode from one encoding to another in memory just to still have to work on encoded bytes? There's no more reason for you to be passing byteswapped, BOM-stripped UTF-16 to re.split than there is for you to be passing any other encoded bytes to re.split.

> In-memory wide strings should _never_ be in an endianness other than the
> machine's native one and should _never_ have a BOM. That should be taken
> care of when reading it off the disk/wire. If you haven't done that, you
> still have a byte array, which it's not so easy to accidentally assume
> you'll be able to split up and pass to your fromstring function.

I explicitly mentioned opening the file in binary mode, reading it in, and passing it to some fromstring function that takes bytes, so yes, of course you have a byte array.

And again, if you have UTF-16, even in native endianness and without a BOM, that's still a byte array, so how is that any different?

And of course you can have in-memory byte arrays with a BOM, or in non-native endianness; that's what the UTF-16 and UTF-16-BE (or -LE) codecs produce and consume.

And it _is_ easy to use those byte arrays, exactly as easy as to use UTF-8 byte arrays or native-endian BOM-less UTF-16 byte arrays or anything else. All you need is a library that's willing to do the decoding for you in its loads/fromstring/etc. function, which includes most libraries on PyPI (because otherwise they wouldn't work with str in 2.x). See simplejson, for an example.

>> Which is wrong, but it works when I test it,
>> because most UTF-16 files are little-endian, and so is my machine. And
>> then someone runs my app on a big-endian machine and they get a
>> hard-to-debug exception (or, if we're really unlucky, silent mojibake,
>> but that's pretty rare).
> 
> The proper equivalent of a UTF-16 file with a byte-order-mark would be a
> _binary_ StringIO on a _byte_ array containing a BOM and UTF-16.

I mentioned BytesIO; that's what a binary StringIO is called.

> You can
> layer a TextIOWrapper on top of either of them. And it never makes sense
> to expect to be able to arbitrarily split up encoded byte arrays,
> whether those are in UTF-16 or not.

There are countless protocols and file formats that _require_ being able to split byte arrays before decoding them. That's how you split the header and body of an RFC822 message like an email or an HTTP response, and how you parse OLE substreams out of a binary-format Office file.

>> Usually, when you have WCHARs, it's because you opened a file and wread
>> from it, or received UTF-16 over the network of from a Windows FooW API,
>> in which case you have the same endianness issues as any other binary I/O
>> on non-char-sized types. And yes, of course the right answer is to decode
>> at input, but if you're doing that, why wouldn't you just decide to
>> Unicode instead of byte-swapping the WCHARs?
> 
> You shouldn't have WCHARS (of any kind) in the first place until you've
> decoded.

And yet Microsoft's APIs, both Win32 and MSVCRT, are full of wread and similar functions.

But anyway, I'll grant that you usually shouldn't have WCHARs before you've decoded.

But you definitely should not have WCHARs _after_ you've decoded. In fact, you _can't_ have them after you've decoded, because a WCHAR isn't big enough to hold a Unicode code point. If you have WCHARs, either you're still encoded (or just transcoded to UTF-16), or your code will break as soon as you get a Chinese user with a moderately uncommon last name.

So, you should never have WCHARs. Which was my point in the first place.

If you need to deal with UTF-16 streams, treat them as streams of bytes and decode them the same way you would UTF-8 or Big5 or anything else, don't treat them as streams of WCHARs that are often but not always complete Unicode characters.

> If you're receiving UTF-16 of unknown endianness over the
> network you should be receiving it as bytes. If you're directly calling
> a FooW API, you are obviously on a win32 system and you've already got
> native WCHARs in native endianness.

Only if you got those characters from another win32 FooW API, as opposed to, say, from user input from a cross-platform GUI framework that may have different rules from Windows.

> But, once again, that wasn't really
> my point.
> 
> My point that there are no native libc functions for working with utf-8
> strings - even if you're willing to presume that the native multibyte
> character set is UTF-8, there are very few standard functions for
> working with multibyte characters. "ascii compatibility" means you're
> going to write something using strchr or strtok that works for ascii
> characters and does something terrible when given non-ascii multibyte
> characters to search for.

But many specific static patterns _do_ work with ASCII compatible encodings. Again, think of HTTP responses. Even though the headers and body are both text, they're defined as being separated by b"\r\n\r\n".

If this were never useful--or if it often seemed useful but was really just an attractive nuisance--Python 3 wouldn't have bytes.split and bytes.find and be adding bytes.__mod__. Or do you think that proposal is a mistake?

> The benefits of using libc only work if you play by libc's rules, which
> we've established are inadequate. If you're _not_ going to use libc
> string functions, then there's no reason not to prefer UTF-32 (when
> you're not using the FSR, which is essentially a fancy immutable
> container for UTF-32 code points) over UTF-8.

Preferring UTF-32 over UTF-8 makes perfect sense. But that's not what you started out arguing. Nick mentioned off-hand that UTF-16 has the worst of both worlds of UTF-8 and UTF-32, Stephen explained that further to someone else, and you challenged his explanation, arguing that UTF-16 doesn't introduce any problems over UTF-8. But it does. It introduces all the same problems as UTF-32, but without any of the benefits.



More information about the Python-ideas mailing list