Short questions wrt Python & Unicode

KvS keesvanschaik at gmail.com
Fri Jun 9 09:26:51 EDT 2006


John Machin wrote:
> On 9/06/2006 10:04 PM, KvS wrote:
>
> > 2) How do I get a representation of a unic. object in terms of Unicode
> > code points? repr() doesn't do that, it sometimes parses or encodes the
> > code points right:
> >
> >|>>> s=u"\u0040\u0166\u00e6"
> >|>>> s
> > u'@\u0166\xe6'
>
> |>>> ' '.join('U+%04X % ord(c) for c in s)
> 'U+0040 U+0166 U+00E6'
>
> If you'd prefer it more Pythonic than unicode.orgic, adjust the format
> string and separator to suit your taste.
>
> > (does this latter \xe6 have to do with the internal representation of
> > unic. objects, maybe with this  UCS-2 encoding?)
>
> |>>> u'\xe6' == u'\u00e6' == unichr(0xe6)
> True
> |>>> hex(ord(u'\u00e6'))
> '0xe6'
>
> U+nnnnnn is represented internally as the integer 0xnnnnnn -- except if
> it won't fit, but you can pretend that surrogate pairs don't exist, for
> the moment :-)
>
> Cheers,
> John

Thanks to you and Fredrik! What about q1? I know it's silly since for
integers e.g. one doesn't give such an issue any thought at all, it's
just that this understanding of en/decodings etc. make things a bit
more blurry to me. It should be the case that a package may do
internally (en-/decodign etc.) what it wants to represent/manipulate
unic. strings but should always communicate to the outside world via
the interchangable & uniform Python unicode object right?




More information about the Python-list mailing list