[Python-ideas] Processing surrogates in

Fri May 15 23:52:18 CEST 2015

On Fri, May 15, 2015, at 15:37, Andrew Barnert wrote:
> Conversely, if you have UTF-16--even in native order and with the BOM
> stripped--you don't have text, you still have bytes (or WCHARs, if you
> prefer, but not in Python).

This line of discussion began with someone asserting the [dubious]
merits of using the native libc functions, which on windows does mean
UTF-16 WCHARs as well as (ASCII, but certainly not properly-handled
UTF-8) bytes.

> I explicitly mentioned opening the file in binary mode, reading it in,
> and passing it to some fromstring function that takes bytes, so yes, of
> course you have a byte array.

Why would a fromstring function take bytes? How would you use re.split
on it?

> > You shouldn't have WCHARS (of any kind) in the first place until you've
> > decoded.
> 
> And yet Microsoft's APIs, both Win32 and MSVCRT, are full of wread and
> similar functions.

No such thing as "wread". And given the appropriate flags to _open,
_read can perform decoding.

> But anyway, I'll grant that you usually shouldn't have WCHARs before
> you've decoded.
> 
> But you definitely should not have WCHARs _after_ you've decoded. In
> fact, you _can't_ have them after you've decoded, because a WCHAR isn't
> big enough to hold a Unicode code point.

You're nitpicking on word choice. Going from bytes to UTF-16 words
[whether as WCHAR or unsigned short] is a form of decoding. Or don't you
think python narrow builds' decode function was properly named?

> But many specific static patterns _do_ work with ASCII compatible
> encodings. Again, think of HTTP responses. Even though the headers and
> body are both text, they're defined as being separated by b"\r\n\r\n".

Right, but those aren't UTF-8. Working with ASCII is fine, but don't
pretend you've actually found a way to work with UTF-8.

> Preferring UTF-32 over UTF-8 makes perfect sense. But that's not what you
> started out arguing. Nick mentioned off-hand that UTF-16 has the worst of
> both worlds of UTF-8 and UTF-32, Stephen explained that further to
> someone else, and you challenged his explanation, arguing that UTF-16
> doesn't introduce any problems over UTF-8.
> But it does. It introduces all
> the same problems as UTF-32, but without any of the benefits.

No, because UTF-32 has the additional problem, shared with UTF-8, that
(Windows) libc doesn't support it.

My point was that if you want the benefits of using libc you have to pay
the costs of using libc, and that means using libc's native encodings.
Which, on Windows, are UTF-16 and (e.g.) Codepage 1252. If you don't
want the benefits of using libc, then there's no benefit to using UTF-8.