[Python-Dev] Ill-defined encoding for CP875?
Tim Peters
tim.one@home.com
Sat, 12 May 2001 17:22:49 -0400
[/F]
> reverse sorting makes sense to me. but the cp-files appear to be
> machine generated, so patching that python file won't help.
Agreed.
> a truly future-proof solution would be to specify exactly how to
> resolve every many-to-one mapping, for every font having that
> problem. but sorting them is clearly better than relying on
> implementation-dependent behaviour...
The attached program suggests the problem is rare; of those encoding files
that have a Python decode_map dict, only these triggered a meaningful
ambiguity complaint:
*** cp1006.py maps 0xfe8e back to 0xb1, 0xb2
*** cp875.py maps 0x1a back to 0x3f, 0xdc, 0xe1, 0xec, 0xed, 0xfc, 0xfd
Then since test_unicode only checks for roundtrip across range(0x80), cp875
is the only one that *can* fail (the ambiguities in cp1006 are for points >
0x7f, so aren't tested here).
Hmm! Now I see that in a part of test_unicode that wasn't reached, cp875 and
cp1006 are excluded, with this comment:
### These fail the round-trip:
#'cp1006', 'cp875', 'iso8859_8',
So the practical hack for now is to exclude cp875 from the earlier range(128)
roundtrip test too.
> (is Jython using exactly the same hashing and dictionary algorithms as
> CPython? or does it work by accident also under Jython?)
Sorry, no idea. Attempting to browse the Jython source on SourceForge caused
this cute behavior:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/jython/jython/Lib/
Python Exception Occurred
Traceback (innermost last):
File "/usr/lib/cgi-bin/viewcvs.cgi", line 2286, in ?
main()
File "/usr/lib/cgi-bin/viewcvs.cgi", line 2253, in main
view_directory(request)
File "/usr/lib/cgi-bin/viewcvs.cgi", line 1043, in view_directory
fileinfo, alltags = get_logs(full_name, rcs_files, view_tag)
File "/usr/lib/cgi-bin/viewcvs.cgi", line 987, in get_logs
raise 'error during rlog: '+hex(status)
error during rlog: 0x100
let's-rewrite-it-in-php<wink>-ly y'rs - tim
ENCODING_DIR = "../Lib/encodings"
import os
import imp
def d(w):
if type(w) is type(6):
return hex(w)
else:
return repr(w)
encfiles = [name for name in os.listdir(ENCODING_DIR)
if name.endswith(".py") and name[0] != "_"]
for fname in encfiles:
path = os.path.join(ENCODING_DIR, fname)
f = open(path)
module = imp.load_source(fname[:-3], path, f)
f.close()
decode = getattr(module, "decoding_map", None)
if decode is None:
print fname, "doesn't have decoding_map."
continue
vtok = {}
for k, v in decode.items():
if v in vtok:
vtok[v].append(k)
else:
vtok[v] = [k]
ambiguous = [(v, ks) for v, ks in vtok.items()
if len(ks) > 1]
if ambiguous:
for v, ks in ambiguous:
ks.sort()
print "***", fname, "maps", d(v), "back to", \
", ".join(map(d, ks))
else:
print fname, "is free of ambiguous reverse maps."