How do I display unicode value stored in a string variable using ord()
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Aug 19 09:25:14 EDT 2012
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>> This standard data structure is called UCS-2 ... There's an extension
>> to UCS-2 called UTF-16
>
> My own understanding is UCS-2 simply shouldn't be used any more.
Pretty much. But UTF-16 with lax support for surrogates (that is,
surrogates are included but treated as two characters) is essentially
UCS-2 with the restriction against surrogates lifted. That's what Python
currently does, and Javascript.
http://mathiasbynens.be/notes/javascript-encoding
The reality is that support for the Unicode supplementary planes is
pretty poor. Even when applications support it, most fonts don't have
glyphs for the characters. Anything which makes handling of Unicode
supplementary characters better is a step forward.
>> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
>> operations are not O(1) but are O(N). That means they are slow, or
>> buggy, pick one.
>
> This I don't see. What are the basic string operations?
The ones I'm specifically referring to are indexing and copying
substrings. There may be others.
> * Examine the first character, or first few characters ("few" = "usually
> bounded by a small constant") such as to parse a token from an input
> stream. This is O(1) with either encoding.
That's actually O(K), for K = "a few", whatever "a few" means. But we
know that anything is fast for small enough N (or K in this case).
> * Slice off the first N characters. This is O(N) with either encoding
> if it involves copying the chars. I guess you could share references
> into the same string, but if the slice reference persists while the
> big reference is released, you end up not freeing the memory until
> later than you really should.
As a first approximation, memory copying is assumed to be free, or at
least constant time. That's not strictly true, but Big Oh analysis is
looking at algorithmic complexity. It's not a substitute for actual
benchmarks.
> Meanwhile, an example of the 393 approach failing: I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.
I assume that this wasn't one giant multi-terrabyte string.
> So
> the chars were mostly ascii, but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision. That's a
> natural for UTF-8 but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.
Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)
Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.
[...]
> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?
There has to be a first time for everything.
> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered.
Ropes have been considered and rejected because while they are
asymptotically fast, in common cases the added complexity actually makes
them slower. Especially for immutable strings where you aren't inserting
into the middle of a string.
http://mail.python.org/pipermail/python-dev/2000-February/002321.html
PyPy has revisited ropes and uses, or at least used, ropes as their
native string data structure. But that's ropes of *bytes*, not UTF-8.
http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html
--
Steven
More information about the Python-list
mailing list