Unicode and exception strings

Francis Avila francisgavila at yahoo.com
Tue Jan 13 20:55:35 EST 2004


Terry Carroll wrote in message ...
>On 12 Jan 2004 08:41:43 +0100, Rune Froysa <rune.froysa at usit.uio.no>
>wrote:
>The only thing is, what to do with it once you get there.  I don't think
>0xF8 is a valid unicode encoding on its own.  IIRC, it's part of a
>multibyte character.

Yes, about that.

What are the semantics of hexadecimal literals in unicode literals?  It
seems to me that it is meaningless, if not dangerous, to allow hexadecimal
literals in unicode.  What code point would it correspond to?

Python 2.3.2 (#49, Oct  2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\xf8\u00f8'.encode('unicode-internal')
'\xf8\x00\xf8\x00'

I get the same on linux with Python 2.2.1, x86.

So, is a hexadecimal literal a shorthand for \u00XX, i.e., unicode code
point XX?  Or does it bypass the code point abstraction entirely, preserving
the raw bits unchanged for any encoding of the unicode string (thus
rendering unicode useless)?

Once again, I don't see why hexadecimal literals should be allowed at all,
except maybe for compatability when moving to Python -U behavior.  But I
submit that all such code is broken, and should be fixed.  If you're using
hexadecimal literals, what you have is not a unicode string but a byte
sequence.

This whole unicode/bytestring mess is going to have to be sorted out
eventually.  It seems to me that it would be best to have all bare string
literals be unicode objects (henceforth called 'str' or 'string' objects?),
drop the unicode literal, and make a new type and literal prefix for byte
sequences, possibly dropping the traditional str methods or absorbing more
appropriate ones.  Perhaps some struct functionality could be folded in?

Of course, this breaks absolutely everything.

--
Francis Avila




More information about the Python-list mailing list