[Tutor] Is there a package to "un-mangle" characters?
Albert-Jan Roskam
fomcl at yahoo.com
Fri Nov 22 14:58:38 CET 2013
<snip>
> Today I had a csv file in utf-8 encoding, but part of
the accented
> characters were mangled. The data were scraped from a
website and it
> turned out that at least some of the data were mangled
on the website
> already. Bits of the text were actually cp1252 (or
cp850), I think,
> even though the webpage was in utf-8 Is there any
package that helps
> to correct such issues?
The links in the Wikipedia article may help:
http://en.wikipedia.org/wiki/Charset_detection
International Components for Unicode (ICU) does charset
detection:
http://userguide.icu-project.org/conversion/detection
Python wrapper:
http://pypi.python.org/pypi/PyICU
http://packages.debian.org/wheezy/python-pyicu
Example:
import icu
russian_text = u'Здесь некий
текст на русском языке.'
encoded_text =
russian_text.encode('windows-1251')
cd = icu.CharsetDetector()
cd.setText(encoded_text)
match = cd.detect()
matches = cd.detectAll()
>>> match.getName()
'windows-1251'
>>> match.getConfidence()
33
>>> match.getLanguage()
'ru'
>>> [m.getName() for m in matches]
['windows-1251', 'ISO-8859-6', 'ISO-8859-8-I',
'ISO-8859-8']
>>> [m.getConfidence() for m in
matches]
[33, 13, 8, 8]
====> Hi Mark, Eryksun,
Thank you very much for your suggestions. Mark (sorry if I repeat myself but I think my earlier reply got lost), charset seems worth looking into. In hindsight I knew about chardet (with 'd'), I just forgot about it. Re: your other remark: I think encoding issues are such a common phenomenon that one can never be too inexperienced to start reading about it.
The ICU module seems very cool too. I like the fact that you can even calculate a level of confidence. I wonder how it performs in my language (Dutch), where accented characters are not very common.
Most is ascii (the printable chars in 0-128) and those are (I think) useless for trying to figure out the encoding. After all, utf-8, latin-1, cp1252, iso-8859-1 are all supersets of ascii. But in practice I treat those last three encodings as the same anyway (or was there some sneaky difference with fancyquotes?).
I did a quick check and 0.2 % of the street names in my data (about 300K records) contain one or more accented characters (ordinals > 128). Since only part of the records are mangled, I may need to run getName() on every record that has accented characters in it.
Regards,
Albert-Jan
More information about the Tutor
mailing list