[Python-3000] BOM handling
Antoine Pitrou
solipsis at pitrou.net
Wed Sep 13 22:33:22 CEST 2006
Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :
> And is generally ignored, as per unicode spec; it's a "zero width
> non-breaking space" - an invisible character with no effect on wrapping
> or otherwise.
Well it would be better if Py3K (with all strings unicode) makes things
easy for the programmer and abstracts away those "invisible characters
with no textual meaning". Currently it's not the case:
>>> a = "hello".decode("utf-8")
>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False
>>> a = "hello".encode("utf-16le").decode("utf-16le")
>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False
>>> a
u'hello'
>>> b
u'\ufeffhello'
>>> print a
hello
>>> print b
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
Regards
Antoine.
More information about the Python-3000
mailing list