Unicode chr(150) en dash
J. Clifford Dyer
jcd at sdf.lonestar.org
Fri Apr 18 13:36:00 CEST 2008
On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote:
> On Fri, 2008-04-18 at 10:28 +0100, marexposed at googlemail.com wrote:
> > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> > hdante <hdante at gmail.com> wrote:
> > > Don't use old 8-bit encodings. Use UTF-8.
> > Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content.
> > To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business.
> > I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters.
> > Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now.
> > Thanks to everyone for the great help.
> There are a number of code points (150 being one of them) that are used
> in cp1252, which are reserved for control characters in ISO-8859-1.
> Those characters will pretty much never be used in ISO-8859-1 documents.
> If you're expecting documents of both types coming in, test for the
> presence of those characters, and assume cp1252 for those documents.
> Something like:
> for c in control_chars:
> if c in encoded_text:
> unicode_text = encoded_text.decode('cp1252')
> unicode_text = encoded_text.decode('latin-1')
> Note that the else matches the for, not the if.
> You can figure out the characters to match on by looking at the
> wikipedia pages for the encodings.
One warning: This works if you know all your documents are in one of
those two encodings, but you could break other encodings, like UTF-8
this way. Fortunately UTF-8 is a pretty fragile encoding, so it's easy
to break. You can usually test if a document is decent UTF-8 just by
wrapping it in a try except block:
unicode_text = encoded.text.decode('utf-8')
except UnicodeEncodeError: # I think that's the proper exception
# do the stuff above
None of these are perfect methods, but then again, if text encoding
detection were a perfect science, python could just handle it on its
If in doubt, prompt the user for confirmation.
Maybe others can share better "best practices."
More information about the Python-list