How do I display unicode value stored in a string variable using ord()
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Aug 19 04:01:46 EDT 2012
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:
> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal. It gets more expensive if you
> want to index far more deeply into the string. I'm asking how often
> that is done in real code.
It happens all the time.
Let's say you've got a bunch of text, and you use a regex to scan through
it looking for a match. Let's ignore the regular expression engine, since
it has to look at every character anyway. But you've done your search and
found your matching text and now want everything *after* it. That's not
exactly an unusual use-case.
mo = re.search(pattern, text)
if mo:
start, end = mo.span()
result = text[end:]
Easy-peasy, right? But behind the scenes, you have a problem: how does
Python know where text[end:] starts? With fixed-size characters, that's
O(1): Python just moves forward end*width bytes into the string. Nice and
fast.
With a variable-sized characters, Python has to start from the beginning
again, and inspect each byte or pair of bytes. This turns the slice
operation into O(N) and the combined op (search + slice) into O(N**2),
and that starts getting *horrible*.
As always, "everything is fast for small enough N", but you *really*
don't want O(N**2) operations when dealing with large amounts of data.
Insisting that the regex functions only ever return offsets to valid
character boundaries doesn't help you, because the string slice method
cannot know where the indexes came from.
I suppose you could have a "fast slice" and a "slow slice" method, but
really, that sucks, and besides all that does is pass responsibility for
tracking character boundaries to the developer instead of the language,
and you know damn well that they will get it wrong and their code will
silently do the wrong thing and they'll say that Python sucks and we
never used to have this problem back in the good old days with ASCII. Boo
sucks to that.
UCS-4 is an option, since that's fixed-width. But it's also bulky. For
typical users, you end up wasting memory. That is the complaint driving
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to
multiply your string memory by four just in case somebody someday gives
you a character in one of the supplementary planes.
If you have oodles of memory and small data sets, then UCS-4 is probably
all you'll ever need. I hear that the club for people who have all the
memory they'll ever need is holding their annual general meeting in a
phone-booth this year.
You could say "Screw the full Unicode standard, who needs more than 64K
different characters anyway?" Well apart from Asians, and historians, and
a bunch of other people. If you can control your data and make sure no
non-BMP characters are used, UCS-2 is fine -- except Python doesn't
actually use that.
You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up
to the individual programmer to track character boundaries, and we know
how well that works. Luckily the supplementary planes are only rarely
used, and people who need them tend to buy more memory and use wide
builds. People who only need a few non-BMP characters in a narrow build
generally just cross their fingers and hope for the best.
You could add a whole lot more heavyweight infrastructure to strings,
turn them into suped-up ropes-on-steroids. All those extra indexes mean
that you don't save any memory. Because the objects are so much bigger
and more complex, your CPU cache goes to the dogs and your code still
runs slow.
Which leaves us right back where we started, PEP 393.
> Obviously one can concoct hypothetical examples that would suffer.
If you think "slicing at arbitrary indexes" is a hypothetical example, I
don't know what to say.
--
Steven
More information about the Python-list
mailing list