REQ : encoding windows cp1252 => iso latin 1

Brian Quinlan brian at sweetapp.com
Tue Feb 5 20:11:13 CET 2002


Gillou wrote:
> My customers make copy/paste from M$ word docs to forms translated to
XML
> (expecting ISO latin 1 charset).
> My XML parser (pyexpat) does not accept cp1252 character, and I'm
looking
> for a function that can translate extra cp1252 characters to the
closest
> ISO latin 1 encoding.

I am such a nice guy. Here is a completely untested solution:

"""
80 20AC EURO SIGN 
81  UNDEFINED 
82 201A SINGLE LOW-9 QUOTATION MARK 
83 0192 LATIN SMALL LETTER F WITH HOOK 
84 201E DOUBLE LOW-9 QUOTATION MARK 
85 2026 HORIZONTAL ELLIPSIS 
86 2020 DAGGER 
87 2021 DOUBLE DAGGER 
88 02C6 MODIFIER LETTER CIRCUMFLEX ACCENT 
89 2030 PER MILLE SIGN 
8A 0160 LATIN CAPITAL LETTER S WITH CARON 
8B 2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK 
8C 0152 LATIN CAPITAL LIGATURE OE 
8D  UNDEFINED 
8E 017D LATIN CAPITAL LETTER Z WITH CARON 
8F  UNDEFINED 
90  UNDEFINED 
91 2018 LEFT SINGLE QUOTATION MARK 
92 2019 RIGHT SINGLE QUOTATION MARK 
93 201C LEFT DOUBLE QUOTATION MARK 
94 201D RIGHT DOUBLE QUOTATION MARK 
95 2022 BULLET 
96 2013 EN DASH 
97 2014 EM DASH 
98 02DC SMALL TILDE 
99 2122 TRADE MARK SIGN 
9A 0161 LATIN SMALL LETTER S WITH CARON 
9B 203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 
9C 0153 LATIN SMALL LIGATURE OE 
9D  UNDEFINED 
9E 017E LATIN SMALL LETTER Z WITH CARON 
9F 0178 LATIN CAPITAL LETTER Y WITH DIAERESIS 
"""

replacement = {
    # You pick the "closest characters"
    0x80: "e"
    0x82: ","
    ...
}

def cp1252_to_8859_1(str):
    out_str = ''
    for i in str:
        # The only different characters are from 0x80-0x9f
        if (ord(i) >= 0x80) and (ord(i) <= 0x9f):
		# Or throw an exception and ask them not to use 
            # dumb characters
            out_str += replacement[ord(i)]
        else:
            out_str += i

    return out_str







More information about the Python-list mailing list