[Tutor] Is there a package to "un-mangle" characters?

Albert-Jan Roskam fomcl at yahoo.com
Fri Nov 22 14:58:38 CET 2013


<snip>
 > Today I had a csv file in utf-8 encoding, but part of
 the accented
 > characters were mangled. The data were scraped from a
 website and it
 > turned out that at least some of the data were mangled
 on the website
 > already. Bits of the text were actually cp1252 (or
 cp850), I think,
 > even though the webpage was in utf-8 Is there any
 package that helps
 > to correct such issues?
 
 The links in the Wikipedia article may help:
 
     http://en.wikipedia.org/wiki/Charset_detection
 
 International Components for Unicode (ICU) does charset
 detection:
 
     http://userguide.icu-project.org/conversion/detection
 
 Python wrapper:
 
     http://pypi.python.org/pypi/PyICU
     http://packages.debian.org/wheezy/python-pyicu
 
 Example:
 
     import icu
 
     russian_text = u'Здесь некий
 текст на русском языке.'
     encoded_text =
 russian_text.encode('windows-1251')
 
     cd = icu.CharsetDetector()
     cd.setText(encoded_text)
     match = cd.detect()
     matches = cd.detectAll()
 
     >>> match.getName()
     'windows-1251'
     >>> match.getConfidence()
     33
     >>> match.getLanguage()
     'ru'
 
     >>> [m.getName() for m in matches]
     ['windows-1251', 'ISO-8859-6', 'ISO-8859-8-I',
 'ISO-8859-8']
     >>> [m.getConfidence() for m in
 matches]
     [33, 13, 8, 8]


====> Hi Mark, Eryksun,

Thank you very much for your suggestions. Mark (sorry if I repeat myself but I think my earlier reply got lost), charset seems worth looking into. In hindsight I knew about chardet (with 'd'), I just forgot about it.  Re: your other remark: I think encoding issues are such a common phenomenon that one can never be too inexperienced to start reading about it.

The ICU module seems very cool too. I like the fact that you can even calculate a level of confidence. I wonder how it performs in my language (Dutch), where accented characters are not very common. 

Most is ascii (the printable chars in 0-128) and those are (I think) useless for trying to figure out the encoding. After all, utf-8, latin-1, cp1252, iso-8859-1 are all supersets of ascii. But in practice I treat those last three encodings as the same anyway (or was there some sneaky difference with fancyquotes?). 

I did a quick check and 0.2 % of the street names in my data (about 300K records) contain one or more accented characters (ordinals > 128). Since only part of the records are mangled, I may need to run getName() on every record that has accented characters in it.
 
Regards,
Albert-Jan
 


More information about the Tutor mailing list