[Tutor] encoding question

Steven D'Aprano steve at pearwood.info
Sun Jan 5 00:52:44 CET 2014


On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
> Any suggestions as to a better way to handle the problem of encoding in 
> the following context would be appreciated.

Python gives you lots of useful information when errors occur, but 
unfortunately your code throws that information away and replaces it 
with a totally useless message:

>             try:
>                 country = item[9:].decode('utf-8')
>             except:
>                 print("Exception raised.")

Oh great. An exception was raised. What sort of exception? What error 
message did it have? Why did it happen? Nobody knows, because you throw 
it away.

Never, never, never do this. If you don't understand an exception, you 
have no business covering it up and hiding that it took place. Never use 
a bare try...except, always catch the *smallest* number of specific 
exception types that make sense. Better is to avoid catching exceptions 
at all: an exception (usually) means something has gone wrong. You 
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

"I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare." -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash, 
it just covers up the fact that an exception occured.


> The output I get on an Ubuntu 12.4LTS system is as follows:
> alex at x301:~/Python/Parse$ ./IP_info.py3
> Exception raised.
>     IP address is 201.234.178.62:
>         Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
>         Lat/Long: 10.4/-75.2833
> 
> 
> I would have thought that utf-8 could handle the 'a-acute'.

Of course it can:

py> 'Bogotá'.encode('utf-8')
b'Bogot\xc3\xa1'

py> b'Bogot\xc3\xa1'.decode('utf-8')
'Bogotá'


But you don't have UTF-8. You have something else, and trying to decode 
it using UTF-8 fails.

py> b'Bogot\xe1'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: 
unexpected end of data


More to follow...




-- 
Steven


More information about the Tutor mailing list