tuples, index method, Python's design

Sun Apr 15 13:49:38 EDT 2007

On Apr 15, 1:55 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> "Rhamphoryncus" <rha... at gmail.com> writes:
> > Indexing cost, memory efficiency, and canonical representation: pick
> > two.  You can't use a canonical representation (scalar values) without
> > some sort of costly search when indexing (O(log n) probably) or by
> > expanding to the worst-case size (UTF-32).  Python has taken the
> > approach of always providing efficient indexing (O(1)), but you can
> > compile it with either UTF-16 (better memory efficiency) or UTF-32
> > (canonical representation).
>
> I still don't get it.  UTF-16 is just a data compression scheme, right?
> I mean, s[17] isn't the 17th character of the (unicode) string regardless
> of which memory byte it happens to live at?  It could be that that accessing
> it takes more than constant time, but that's hidden by the implementation.
>
> So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

On linux (UTF-32):
>>> c = u'\U0010FFFF'
>>> c
u'\U0010ffff'
>>> list(c)
[u'\U0010ffff']

On windows (UTF-32):
>>> c = u'\U0010FFFF'
>>> c
u'\U0010ffff'
>>> list(c)
[u'\udbff', u'\udfff']

The unicode type's repr hides the distinction but you can see it with
list.  Your "single character" is actually two surrogate code points.
s[s.index(c)] would only give you the first surrogate character

--
Adam Olsen, aka Rhamphoryncus