How do I display unicode value stored in a string variable using ord()
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Aug 19 02:33:28 EDT 2012
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
> The change does not just benefit ASCII users. It primarily benefits
> anybody using a wide unicode build with strings mostly containing only
> BMP characters.
Just to be clear:
If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.
But if you have many strings which are all BMP, and only a few strings
containing non-BMP characters, then you will see a big benefit.
> Even for narrow build users, there is the benefit that
> with approximately the same amount of memory usage in most cases, they
> no longer have to worry about non-BMP characters sneaking in and
> breaking their code.
Yes! +1000 on that.
> There is some additional benefit for Latin-1 users, but this has nothing
> to do with Python. If Python is going to have the option of a 1-byte
> representation (and as long as we have the flexible representation, I
> can see no reason not to),
The PEP explicitly states that it only uses a 1-byte format for ASCII
strings, not Latin-1:
"ASCII-only Unicode strings will again use only one byte per character"
and later:
"If the maximum character is less than 128, they use the PyASCIIObject
structure"
and:
"The data and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient)."
> then it is going to be Latin-1 by definition,
Certainly not, either in fact or in principle. There are a large number
of 1-byte encodings, Latin-1 is hardly the only one.
> because that's what 1-byte Unicode (UCS-1, if you will) is. If you have
> an issue with that, take it up with the designers of Unicode.
The designers of Unicode have never created a standard "1-byte Unicode"
or UCS-1, as far as I can determine.
The Unicode standard refers to some multiple million code points, far too
many to fit in a single byte. There is some historical justification for
using "Unicode" to mean UCS-2, but with the standard being extended
beyond the BMP, that is no longer valid.
See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.
I think what you are trying to say is that the Unicode designers
deliberately matched the Latin-1 standard for Unicode's first 256 code
points. That's not the same thing though: there is no Unicode standard
mapping to a single byte format.
--
Steven
More information about the Python-list
mailing list