On 30 January 2018 at 06:54, Chris Barker
On Fri, Jan 26, 2018 at 5:27 PM, Steven D'Aprano
wrote: tcl/tk and Javascript only support UCS-2 (16 bit) Unicode strings. Dealing with the Supplementary Unicode Planes have the same problems that older "narrow" builds of Python sufferred from: single code points were counted as len(2) instead of len(1), slicing could be wrong, etc.
There are still many applications which assume Latin-1 data. For instance, I use a media player which displays mojibake when passed anything outside of Latin-1.
Sometimes it is useful to know in advance when text you pass to another application is going to run into problems because of the other application's limitations.
I'm confused -- isn't the way to do this to encode your text into the encoding the other application accepts ?
if you really want to know in advance, it is so hard to run it through a encode/decode sandwich?
Wait -- I can't find UCS-2 in the built-in encodings -- am I dense or is it not there? Shouldn't it be? If only for this reason?
If you're wanting to check whether or not something lies entirely within the BMP, check for: 2*len(text) == len(text.encode("utf-16")) # True iff text is UCS-2 If there's an astral code point in there, then the encoded version will need more than 2 bytes for at least one element, so the result will end up being longer than it would for UCS-2 data. You can also check for pure ASCII in much the same way: len(text) == len(text.encode("utf-8")) # True iff text is 7-bit ASCII So this is partly an optimisation question: - folks want to avoid allocating a bytes object just to throw it away - folks want to avoid running the equivalent of "max(map(ord, text))" - folks know that CPython (at least) tracks this kind of info internally to manage its own storage allocations But it's also a readability question: "is_ascii()" and "is_UCS2()/is_BMP()" just require knowing what 7-bit ASCII and UCS-2 (or the basic multilingual plane) *are*, whereas the current ways of checking for them require knowing how they *behave*. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia