[Python-checkins] r56602 - tracker/importer/fixsfmojibake.py

martin.v.loewis python-checkins at python.org
Sat Jul 28 12:22:29 CEST 2007


Author: martin.v.loewis
Date: Sat Jul 28 12:22:29 2007
New Revision: 56602

Added:
   tracker/importer/fixsfmojibake.py   (contents, props changed)
Log:
Add SF xml_export2 encoding fixing script.


Added: tracker/importer/fixsfmojibake.py
==============================================================================
--- (empty file)
+++ tracker/importer/fixsfmojibake.py	Sat Jul 28 12:22:29 2007
@@ -0,0 +1,49 @@
+#!/usr/bin/python
+# The data exported from SF often are incorrectly
+# encoded - two subsequent Unicode character have to
+# be interpreted as the two bytes of a single UTF-8
+# character; it looks like the have UTF-8 in the database
+# but encode it as if it was Latin-1.
+# Unfortunately, this is not consistently so: for some
+# data, the intended encoding is really Latin-1.
+# This scripts tries to fix it, by recoding everything
+# that looks like UTF-8 into the then-proper character
+# references.
+
+# The script assumes that the file encoding is actually
+# ASCII, and that non-ASCII characters are always encoded
+# as decimal character references.
+
+import sys, re
+
+expr = re.compile('(&#[0-9]+;)+')
+
+def recode(group):
+    assert group[:2] == '&#' and group[-1] == ';'
+    chars = group[2:-1].split(';&#')
+    chars = [unichr(int(c)) for c in chars]
+    chars = u''.join(chars)
+    try:
+        chars = chars.encode('latin-1').decode('utf-8')
+    except UnicodeError:
+        return group
+    chars = ['&#%d;' % ord(c) for c in chars]
+    return ''.join(chars)
+
+print >>sys.stderr, len(indata)
+
+# Make sure that there are only &#decimal; references,
+# and that all &# occurrences are markup
+assert indata.find('&#x') == -1
+assert indata.find('<![CDATA[') == -1
+
+pos = 0
+while True:
+    m = expr.search(indata, pos)
+    if m is None:
+        sys.stdout.write(indata[pos:])
+        break
+    sys.stdout.write(indata[pos:m.start()])
+    sys.stdout.write(recode(m.group()))
+    pos = m.end()
+


More information about the Python-checkins mailing list