How do I display unicode value stored in a string variable using ord()
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sat Aug 18 00:10:30 EDT 2012
On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote:
> On 08/17/2012 08:21 PM, Ian Kelly wrote:
>> On Aug 17, 2012 2:58 PM, "Dave Angel" <d at davea.name> wrote:
>>> The internal coding described in PEP 393 has nothing to do with
>>> latin-1 encoding.
>> It certainly does. PEP 393 provides for Unicode strings to be
>> represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is
>> smallest and sufficient to contain the data.
Unicode strings are not represented as Latin-1 internally. Latin-1 is a
byte encoding, not a unicode internal format. Perhaps you mean to say
that they are represented as a single byte format?
>> I understand the complaint
>> to be that while the change is great for strings that happen to fit in
>> Latin-1, it is less efficient than previous versions for strings that
>> do not.
>
> That's not the way I interpreted the PEP 393. It takes a pure unicode
> string, finds the largest code point in that string, and chooses 1, 2 or
> 4 bytes for every character, based on how many bits it'd take for that
> largest code point.
That's how I interpret it too.
> Further i read it to mean that only 00 bytes would
> be dropped in the process, no other bytes would be changed.
Just to clarify, you aren't talking about the \0 character, but only to
extraneous "padding" 00 bytes.
> I also figure this is going to be more space efficient than Python 3.2
> for any string which had a max code point of 65535 or less (in Windows),
> or 4billion or less (in real systems). So unless French has code points
> over 64k, I can't figure that anything is lost.
I think that on narrow builds, it won't make terribly much difference.
The big savings are for wide builds.
--
Steven
More information about the Python-list
mailing list