[Tutor] unichr not working as expected

Steven D'Aprano steve at pearwood.info
Tue Jul 23 06:00:34 CEST 2013


On 23/07/13 04:14, Jim Mooney wrote:
>   I tried translating the odd chars I found in my dos tree /f listing to
> symbols, but I'm getting this error. The chars certainly aren't over
> 10000,  The ord is only 13 - so what's wrong here?
>
> def main():
>      zark = ''
>      for x in "ÀÄÄÄ":
>          zark += unichr(ord(x)-45)


This is broken in three ways that I can see.

Firstly, assuming you are using Python 2.7 (as you have said in the past), "ÀÄÄÄ" does not mean what you think it means.

In Python 3, this is a Unicode string containing four individual characters:

LATIN CAPITAL LETTER A WITH GRAVE
LATIN CAPITAL LETTER A WITH DIAERESIS
LATIN CAPITAL LETTER A WITH DIAERESIS
LATIN CAPITAL LETTER A WITH DIAERESIS

Why you have duplicates, I do not know :-)

But in Python 2, that's not what you will get. What you get depends on your environment, and is unpredictable. For example, on my system, using a Linux terminal interactively with the terminal set to UTF-8, I get:

py> for c in "ÀÄ":  # removing duplicates
...     print c, ord(c)
...
� 195
� 128
� 195
� 132


Yes, that's right, I get FOUR (not two) "characters" (actually bytes). But if I change the terminal settings to, say, ISO-8859-7:

py> for c in "ΓΓ":
...     print c, ord(c)
...
Γ 195
  128
Γ 195
  132

the bytes stay the same (195, 128, 195, 132) but the *meaning* of those bytes change completely.

So, the point is, if you are running Python 2.7, what you get from a byte string like "ÀÄ" is unpredictable. What you need is a Unicode string u"ÀÄ", which will exactly what it looks like.

That's the first issue.

Second issue, you build up a string using this idiom:

zark = ''
for c in something:
     zark += c


Even though this works, this is a bad habit to get into and you should avoid it: it risks being unpredictably slower than continental drift, and in a way that is *really* hard to diagnose. I've seen a case of this fool the finest Python core developers for *weeks*, regarding a reported bug where Python was painfully slow but only for SOME but not all Windows users.

The reason why accumulating strings using + can be slow when there are a lot of strings is because it is a Shlemiel the painter's algorithm:

http://www.joelonsoftware.com/articles/fog0000000319.html‎


The reason why sometimes it is *not* slow is that CPython 2.3 and beyond includes a clever optimization trick which can *sometimes* fix this issue, but it depends on details for the operating system's memory handling, and of course it doesn't apply to other implementations like Jython, IronPython, PyPy and Nuitka.

So do yourself a favour and get out of the habit of accumulating strings in a for loop using + since it will bite you one day. (Adding one or two strings is fine.)


Problem number three: you generate characters using this:

unichr(ord(x)-45)

but that gives you a negative number if ord(x) is less than 45, which gives you exactly the result you see:

py> unichr(-1)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)


(By the way, you're very naughty. The code you show *cannot possibly generate the error you claim it generates*. Bad Jim, no biscuit!)


I don't understand what the ord(x)-45 is intended to do. The effect is to give the 45th previous character, e.g. the 45th character before 'n' is 'A'. But characters below chr(45) don't have anything 45 characters previous, so you need to rethink what you are trying to do.


-- 
Steven


More information about the Tutor mailing list