Unicode chr(150) en dash

J. Clifford Dyer jcd at sdf.lonestar.org
Fri Apr 18 13:36:00 CEST 2008


On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote:
> On Fri, 2008-04-18 at 10:28 +0100, marexposed at googlemail.com wrote:
> > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> > hdante <hdante at gmail.com> wrote:
> > 
> > >  Don't use old 8-bit encodings. Use UTF-8.
> > 
> > Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content.
> > To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business.
> > 
> > I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters.
> > Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now.
> > 
> > Thanks to everyone for the great help.
> > 
> 
> There are a number of code points (150 being one of them) that are used
> in cp1252, which are reserved for control characters in ISO-8859-1.
> Those characters will pretty much never be used in ISO-8859-1 documents.
> If you're expecting documents of both types coming in, test for the
> presence of those characters, and assume cp1252 for those documents.  
> 
> Something like:
> 
> for c in control_chars:
>     if c in encoded_text:
> 	unicode_text = encoded_text.decode('cp1252')
>         break
> else:
>     unicode_text = encoded_text.decode('latin-1')
> 
> Note that the else matches the for, not the if.
> 
> You can figure out the characters to match on by looking at the
> wikipedia pages for the encodings.

One warning: This works if you know all your documents are in one of
those two encodings, but you could break other encodings, like UTF-8
this way.  Fortunately UTF-8 is a pretty fragile encoding, so it's easy
to break.  You can usually test if a document is decent UTF-8 just by
wrapping it in a try except block:

try:
    unicode_text = encoded.text.decode('utf-8')
except UnicodeEncodeError: # I think that's the proper exception
    # do the stuff above

None of these are perfect methods, but then again, if text encoding
detection were a perfect science, python could just handle it on its
own.

If in doubt, prompt the user for confirmation.

Maybe others can share better "best practices."

Cheers,
Cliff




More information about the Python-list mailing list