How do I display unicode value stored in a string variable using ord()

Paul Rubin no.email at nospam.invalid
Sun Aug 19 10:04:25 CEST 2012


Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
> This is a long post. If you don't feel like reading an essay, skip to the 
> very bottom and read my last few paragraphs, starting with "To recap".

I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response to my post.  I can
only hope some readers will benefit from it.  I regret that I wasn't
more clear about the perspective I posted from, i.e. that I'm already
familiar with how those encodings work.

After reading all of it, I still have the same skepticism on the main
point as before, but I think I see what the issue in contention is, and
some differences in perspectice.  First of all, you wrote:

> This standard data structure is called UCS-2 ... There's an extension
> to UCS-2 called UTF-16

My own understanding is UCS-2 simply shouldn't be used any more.
Unicode was historically supposed to be a 16-bit character set, but that
turned out to not be enough, so the supplementary planes were added.
UCS-2 thus became obsolete and UTF-16 superseded it in 1996.  UTF-16 in
turn is rather clumsy and the later UTF-8 is better in a lot of ways,
but both of these are at least capable of encoding all the character
codes.

On to the main issue:

> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string 
> operations are not O(1) but are O(N). That means they are slow, or buggy, 
> pick one.

This I don't see.  What are the basic string operations?

* Examine the first character, or first few characters ("few" = "usually
  bounded by a small constant") such as to parse a token from an input
  stream.  This is O(1) with either encoding.

* Slice off the first N characters.  This is O(N) with either encoding
  if it involves copying the chars.  I guess you could share references
  into the same string, but if the slice reference persists while the
  big reference is released, you end up not freeing the memory until
  later than you really should.

* Concatenate two strings.  O(N) either way.

* Find length of string.  O(1) either way since you'd store it in
  the string header when you build the string in the first place.
  Building the string has to have been an O(N) operation in either
  representation.

And finally:

* Access the nth char in the string for some large random n, or maybe
  get a small slice from some random place in a big string.  This is
  where fixed-width representation is O(1) while variable-width is O(N).

What I'm not convinced of, is that the last thing happens all that
often.

Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.  That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

    py> s = chr(0xFFFF + 1)
    py> a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error.  s is a one-character string and should not be unpackable.

I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.  By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string.  Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it.  Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.


More information about the Python-list mailing list