How do I display unicode value stored in a string variable using ord()
DJC
djc at news.invalid
Sun Aug 19 11:32:06 EDT 2012
On 19/08/12 15:25, Steven D'Aprano wrote:
> Not necessarily. Presumably you're scanning each page into a single
> string. Then only the pages containing a supplementary plane char will be
> bloated, which is likely to be rare. Especially since I don't expect your
> OCR application would recognise many non-BMP characters -- what does
> U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
> doesn't recognise it, you can't get it in your output. (If you do, the
> OCR software has a nasty bug.)
>
> Anyway, in my ignorant opinion the proper fix here is to tell the OCR
> software not to bother trying to recognise Imperial Aramaic, Domino
> Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
> expecting them in your source material. Not only will the scanning go
> faster, but you'll get fewer wrong characters.
Consider the automated recognition of a CAPTCHA. As the chars have to be
entered by the user on a keyboard, only the most basic charset can be
used, so the problem of which chars are possible is quite limited.
More information about the Python-list
mailing list