tuples, index method, Python's design
Rhamphoryncus
rhamph at gmail.com
Sun Apr 15 13:49:38 EDT 2007
On Apr 15, 1:55 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> "Rhamphoryncus" <rha... at gmail.com> writes:
> > Indexing cost, memory efficiency, and canonical representation: pick
> > two. You can't use a canonical representation (scalar values) without
> > some sort of costly search when indexing (O(log n) probably) or by
> > expanding to the worst-case size (UTF-32). Python has taken the
> > approach of always providing efficient indexing (O(1)), but you can
> > compile it with either UTF-16 (better memory efficiency) or UTF-32
> > (canonical representation).
>
> I still don't get it. UTF-16 is just a data compression scheme, right?
> I mean, s[17] isn't the 17th character of the (unicode) string regardless
> of which memory byte it happens to live at? It could be that that accessing
> it takes more than constant time, but that's hidden by the implementation.
>
> So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
On linux (UTF-32):
>>> c = u'\U0010FFFF'
>>> c
u'\U0010ffff'
>>> list(c)
[u'\U0010ffff']
On windows (UTF-32):
>>> c = u'\U0010FFFF'
>>> c
u'\U0010ffff'
>>> list(c)
[u'\udbff', u'\udfff']
The unicode type's repr hides the distinction but you can see it with
list. Your "single character" is actually two surrogate code points.
s[s.index(c)] would only give you the first surrogate character
--
Adam Olsen, aka Rhamphoryncus
More information about the Python-list
mailing list