[Tutor] encoding question

Sun Jan 5 01:44:45 CET 2014

Following my previous email...

On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
> Any suggestions as to a better way to handle the problem of encoding in 
> the following context would be appreciated.  The problem arose because 
> 'Bogota' is spelt with an acute accent on the 'a'.

Eryksun has given the right answer for how to extract the encoding from 
the webpage's headers. That will help 9 times out of 10. But 
unfortunately sometimes webpages will lack an encoding header, or they 
will lie, or the text will be invalid for that encoding. What to do 
then?

Let's start by factoring out the repeated code in your giant for-loop 
into something more manageable and maintainable:

>     sp = response.splitlines()
>     country = city = lat = lon = ip = ''
>     for item in sp:
>         if item.startswith(b"Country:"):
>             try:
>                 country = item[9:].decode('utf-8')
>             except:
>                 print("Exception raised.")
>                 country = item[9:]
>         elif item.startswith(b"City:"):
>             try:
>                 city = item[6:].decode('utf-8')
>             except:
>                 print("Exception raised.")
>                 city = item[6:]

and so on, becomes:

    encoding = ...  # as per Eryksun's email
    sp = response.splitlines()
    country = city = lat = lon = ip = ''
    for item in sp:
        key, value = item.split(':', 1)
        key = key.decode(encoding).strip()
        value = value.decode(encoding).strip()
        if key == 'Country':
            country = value
        elif key == 'City':
            city = value
        elif key == 'Latitude':
            lat = value
        elif key = "Longitude":
            lon = value
        elif key = 'IP':
            ip = value
        else:
            raise ValueError('unknown key "%s" found' % key)
    return {"Country" : country,
            "City" : city,
            "Lat" : lat,
            "Long" : lon,
            "IP" : ip
            }

But we can do better than that!

    encoding = ...  # as per Eryksun's email
    sp = response.splitlines()
    record = {"Country": None, "City": None, "Latitude": None, 
              "Longitude": None, "IP": None}
    for item in sp:
        key, value = item.split(':', 1)
        key = key.decode(encoding).strip()
        value = value.decode(encoding).strip()
        if key in record:
            record[key] = value
        else:
            raise ValueError('unknown key "%s" found' % key)
    if None in list(record.values()):
        for key, value in record.items():
            if value is None: break
        raise ValueError('missing key in record: %s' % key)
    return record

This simplifies the code a lot, and adds some error-handling. It may be 
appropriate for your application to handle missing keys by using some 
default value, such as an empty string, or some other value that cannot 
be mistaken for an actual value, say "*missing*". But since I don't know 
your application's needs, I'm going to leave that up to you. Better to 
start strict and loosen up later, than start too loose and never realise 
that errors are occuring.

I've also changed the keys "Lat" and "Lon" to "Latitude" and 
"Longitude". If that's a problem, it's easy to fix. Just before 
returning the record, change the key:

    record['Lat'] = record.pop('Latitude')

and similar for Longitude.

Now that the code is simpler to read and maintain, we can start dealing 
with the risk that the encoding will be missing or wrong.

A missing encoding is easy to handle: just pick a default encoding, and 
hope it is the right one. UTF-8 is a good choice. (It's the only 
*correct* choice, everybody should be using UTF-8, but alas they often 
don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header 
is missing, and you should be good.

How to deal with incorrect encodings? That can happen when the website 
creator *thinks* they are using a certain encoding, but somehow invalid 
bytes for that encoding creep into the data. That gives us a few 
different strategies:

(1) The third-party "chardet" module can analyse text and try to guess 
what encoding it *actually* is, rather than what encoding it claims to 
be. This is what Firefox and other web browsers do, because there are an 
awful lot of shitty websites out there. But it's not foolproof, so even 
if it guesses correctly, you still have to deal with invalid data.

(2) By default, the decode method will raise an exception. You can catch 
the exception and try again with a different encoding:

    for codec in (encoding, 'utf-8', 'latin-1'):
        try:
            key = key.decode(codec)
        except UnicodeDecodeError:
            pass
        else:
            break

Latin-1 should be last, because it has the nice property that it will 
*always* succeed. That doesn't mean it will give you the right 
characters, as intended by the person who wrote the website, just that 
it will always give you *some* characters. They may be completely wrong, 
in other words "mojibake", but they'll be something.

An example of mojibake:

py> b = 'Bogotá'.encode('utf-8')
py> b.decode('latin-1')
'BogotÃ¡'

Perhaps a better way is to use the decode/encode error handler. Instead 
of just calling the decode method, you can specify what to do when an 
error occurs: raise an exception, ignore the bad bytes, or replace them 
with some sort of placeholder. We can see the difference here:

py> b = 'Bogotá'.encode('latin-1')
py> print(b)
b'Bogot\xe1'
py> b.decode('utf-8', 'strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: 
unexpected end of data
py> b.decode('utf-8', 'ignore')
'Bogot'
py> b.decode('utf-8', 'replace')
'Bogot�'

My suggestion is to use the 'replace' error handler.

Armed with this, you should be able to write good solid code that can 
handle most encoding-related errors.

-- 
Steven