How do I display unicode value stored in a string variable using ord()
steve+comp.lang.python at pearwood.info
Sun Aug 19 08:33:28 CEST 2012
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
> The change does not just benefit ASCII users. It primarily benefits
> anybody using a wide unicode build with strings mostly containing only
> BMP characters.
Just to be clear:
If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.
But if you have many strings which are all BMP, and only a few strings
containing non-BMP characters, then you will see a big benefit.
> Even for narrow build users, there is the benefit that
> with approximately the same amount of memory usage in most cases, they
> no longer have to worry about non-BMP characters sneaking in and
> breaking their code.
Yes! +1000 on that.
> There is some additional benefit for Latin-1 users, but this has nothing
> to do with Python. If Python is going to have the option of a 1-byte
> representation (and as long as we have the flexible representation, I
> can see no reason not to),
The PEP explicitly states that it only uses a 1-byte format for ASCII
strings, not Latin-1:
"ASCII-only Unicode strings will again use only one byte per character"
"If the maximum character is less than 128, they use the PyASCIIObject
"The data and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient)."
> then it is going to be Latin-1 by definition,
Certainly not, either in fact or in principle. There are a large number
of 1-byte encodings, Latin-1 is hardly the only one.
> because that's what 1-byte Unicode (UCS-1, if you will) is. If you have
> an issue with that, take it up with the designers of Unicode.
The designers of Unicode have never created a standard "1-byte Unicode"
or UCS-1, as far as I can determine.
The Unicode standard refers to some multiple million code points, far too
many to fit in a single byte. There is some historical justification for
using "Unicode" to mean UCS-2, but with the standard being extended
beyond the BMP, that is no longer valid.
See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.
I think what you are trying to say is that the Unicode designers
deliberately matched the Latin-1 standard for Unicode's first 256 code
points. That's not the same thing though: there is no Unicode standard
mapping to a single byte format.
More information about the Python-list