[XML-SIG] Re: Issues with Unicode type
Martin v. Loewis
martin@v.loewis.de
24 Sep 2002 01:06:10 +0200
Uche Ogbuji <uche.ogbuji@fourthought.com> writes:
> I think the real problem is rather than nothing says that len()
> operating on Unicode objects is *not* a count of characters. There is
> nothing that says that len is strictly a count of storage values. I
> think it's perfectly natural to assume len() is a count of characters,
> and Python's docs should be clarified in this regard.
I somewhat disagree. For over a year, I think this is the first time
that anybody ever noticed. By the time somebody notices the next time,
we might be all using UCS-4 builds, and the problem is gone.
> Consider that other built-ins such as repr and the literal parsing
> code does deal in characters and not storage values. So why should
> anyone expect len() to be different.
Actually, up to Python 2.3, literal parsing operates on bytes, not
characters. If you have a non-ASCII encoding in your sources, the
escape backslash would escape only the next byte - which may or may
not be the next character.
Again, few people ever notice.
> As I said the main problem I see with all this in Python is
> inconsistency and lack of docs.
You are just not reading all the docs. There is a PEP that spells out
all these details, deeper than you ever wanted to know.
Regards,
Martin