a simple unicode question

Chris Jones cjns1989 at gmail.com
Thu Oct 22 05:43:58 EDT 2009


On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]

> Characters outside the 16-bit range aren't supported on all builds.
> They won't be supported on most Windows builds, as Windows uses 16-bit
> Unicode extensively:

I knew nothing about UTF-16 & friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)

Moot point on xterm anyway, since you'd be hard put to it to find a
decent terminal font that covers anything outside the BMP.

> 	Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
> 	(Intel)] on win32

> 	>>> unichr(0x10000)
> 	Traceback (most recent call last):
> 	  File "<stdin>", line 1, in <module>
> 	ValueError: unichr() arg not in range(0x10000) (narrow Python build)
> 
> Note that narrow builds do understand names outside of the BMP, and
> generate surrogate pairs for them:
> 
> 	>>> u'\N{LINEAR B SYLLABLE B008 A}'
> 	u'\U00010000'
> 	>>> len(_)
> 	2
> 
> Whether or not using surrogates in this context is a good idea is open to
> debate. What's the advantage of a multi-wchar string over a multi-byte
> string?

I don't understand this last remark, but since I'm only a GNU/Linux
hobbyist, I guess it doesn't make much difference.

Thanks for the code snippet and comments.

CJ



More information about the Python-list mailing list