[Tutor] encoding question
Steven D'Aprano
steve at pearwood.info
Sun Jan 5 01:44:45 CET 2014
Following my previous email...
On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
> Any suggestions as to a better way to handle the problem of encoding in
> the following context would be appreciated. The problem arose because
> 'Bogota' is spelt with an acute accent on the 'a'.
Eryksun has given the right answer for how to extract the encoding from
the webpage's headers. That will help 9 times out of 10. But
unfortunately sometimes webpages will lack an encoding header, or they
will lie, or the text will be invalid for that encoding. What to do
then?
Let's start by factoring out the repeated code in your giant for-loop
into something more manageable and maintainable:
> sp = response.splitlines()
> country = city = lat = lon = ip = ''
> for item in sp:
> if item.startswith(b"Country:"):
> try:
> country = item[9:].decode('utf-8')
> except:
> print("Exception raised.")
> country = item[9:]
> elif item.startswith(b"City:"):
> try:
> city = item[6:].decode('utf-8')
> except:
> print("Exception raised.")
> city = item[6:]
and so on, becomes:
encoding = ... # as per Eryksun's email
sp = response.splitlines()
country = city = lat = lon = ip = ''
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key == 'Country':
country = value
elif key == 'City':
city = value
elif key == 'Latitude':
lat = value
elif key = "Longitude":
lon = value
elif key = 'IP':
ip = value
else:
raise ValueError('unknown key "%s" found' % key)
return {"Country" : country,
"City" : city,
"Lat" : lat,
"Long" : lon,
"IP" : ip
}
But we can do better than that!
encoding = ... # as per Eryksun's email
sp = response.splitlines()
record = {"Country": None, "City": None, "Latitude": None,
"Longitude": None, "IP": None}
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key in record:
record[key] = value
else:
raise ValueError('unknown key "%s" found' % key)
if None in list(record.values()):
for key, value in record.items():
if value is None: break
raise ValueError('missing key in record: %s' % key)
return record
This simplifies the code a lot, and adds some error-handling. It may be
appropriate for your application to handle missing keys by using some
default value, such as an empty string, or some other value that cannot
be mistaken for an actual value, say "*missing*". But since I don't know
your application's needs, I'm going to leave that up to you. Better to
start strict and loosen up later, than start too loose and never realise
that errors are occuring.
I've also changed the keys "Lat" and "Lon" to "Latitude" and
"Longitude". If that's a problem, it's easy to fix. Just before
returning the record, change the key:
record['Lat'] = record.pop('Latitude')
and similar for Longitude.
Now that the code is simpler to read and maintain, we can start dealing
with the risk that the encoding will be missing or wrong.
A missing encoding is easy to handle: just pick a default encoding, and
hope it is the right one. UTF-8 is a good choice. (It's the only
*correct* choice, everybody should be using UTF-8, but alas they often
don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header
is missing, and you should be good.
How to deal with incorrect encodings? That can happen when the website
creator *thinks* they are using a certain encoding, but somehow invalid
bytes for that encoding creep into the data. That gives us a few
different strategies:
(1) The third-party "chardet" module can analyse text and try to guess
what encoding it *actually* is, rather than what encoding it claims to
be. This is what Firefox and other web browsers do, because there are an
awful lot of shitty websites out there. But it's not foolproof, so even
if it guesses correctly, you still have to deal with invalid data.
(2) By default, the decode method will raise an exception. You can catch
the exception and try again with a different encoding:
for codec in (encoding, 'utf-8', 'latin-1'):
try:
key = key.decode(codec)
except UnicodeDecodeError:
pass
else:
break
Latin-1 should be last, because it has the nice property that it will
*always* succeed. That doesn't mean it will give you the right
characters, as intended by the person who wrote the website, just that
it will always give you *some* characters. They may be completely wrong,
in other words "mojibake", but they'll be something.
An example of mojibake:
py> b = 'Bogotá'.encode('utf-8')
py> b.decode('latin-1')
'Bogotá'
Perhaps a better way is to use the decode/encode error handler. Instead
of just calling the decode method, you can specify what to do when an
error occurs: raise an exception, ignore the bad bytes, or replace them
with some sort of placeholder. We can see the difference here:
py> b = 'Bogotá'.encode('latin-1')
py> print(b)
b'Bogot\xe1'
py> b.decode('utf-8', 'strict')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5:
unexpected end of data
py> b.decode('utf-8', 'ignore')
'Bogot'
py> b.decode('utf-8', 'replace')
'Bogot�'
My suggestion is to use the 'replace' error handler.
Armed with this, you should be able to write good solid code that can
handle most encoding-related errors.
--
Steven
More information about the Tutor
mailing list