[Python-Dev] Ill-defined encoding for CP875?

Tim Peters tim.one@home.com
Sat, 12 May 2001 17:22:49 -0400


[/F]
> reverse sorting makes sense to me.  but the cp-files appear to be
> machine generated, so patching that python file won't help.

Agreed.

> a truly future-proof solution would be to specify exactly how to
> resolve every many-to-one mapping, for every font having that
> problem.  but sorting them is clearly better than relying on
> implementation-dependent behaviour...

The attached program suggests the problem is rare; of those encoding files
that have a Python decode_map dict, only these triggered a meaningful
ambiguity complaint:

*** cp1006.py maps 0xfe8e back to 0xb1, 0xb2
*** cp875.py maps 0x1a back to 0x3f, 0xdc, 0xe1, 0xec, 0xed, 0xfc, 0xfd

Then since test_unicode only checks for roundtrip across range(0x80), cp875
is the only one that *can* fail (the ambiguities in cp1006 are for points >
0x7f, so aren't tested here).

Hmm!  Now I see that in a part of test_unicode that wasn't reached, cp875 and
cp1006 are excluded, with this comment:

    ### These fail the round-trip:
    #'cp1006', 'cp875', 'iso8859_8',

So the practical hack for now is to exclude cp875 from the earlier range(128)
roundtrip test too.

> (is Jython using exactly the same hashing and dictionary algorithms as
> CPython?  or does it work by accident also under Jython?)

Sorry, no idea.  Attempting to browse the Jython source on SourceForge caused
this cute behavior:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/jython/jython/Lib/

    Python Exception Occurred

    Traceback (innermost last):
      File "/usr/lib/cgi-bin/viewcvs.cgi", line 2286, in ?
        main()
      File "/usr/lib/cgi-bin/viewcvs.cgi", line 2253, in main
        view_directory(request)
      File "/usr/lib/cgi-bin/viewcvs.cgi", line 1043, in view_directory
        fileinfo, alltags = get_logs(full_name, rcs_files, view_tag)
      File "/usr/lib/cgi-bin/viewcvs.cgi", line 987, in get_logs
        raise 'error during rlog: '+hex(status)
    error during rlog: 0x100

let's-rewrite-it-in-php<wink>-ly y'rs  - tim

ENCODING_DIR = "../Lib/encodings"

import os
import imp

def d(w):
    if type(w) is type(6):
        return hex(w)
    else:
        return repr(w)

encfiles = [name for name in os.listdir(ENCODING_DIR)
                 if name.endswith(".py") and name[0] != "_"]

for fname in encfiles:
    path = os.path.join(ENCODING_DIR, fname)
    f = open(path)
    module = imp.load_source(fname[:-3], path, f)
    f.close()
    decode = getattr(module, "decoding_map", None)
    if decode is None:
        print fname, "doesn't have decoding_map."
        continue
    vtok = {}
    for k, v in decode.items():
        if v in vtok:
            vtok[v].append(k)
        else:
            vtok[v] = [k]
    ambiguous = [(v, ks) for v, ks in vtok.items()
                         if len(ks) > 1]
    if ambiguous:
        for v, ks in ambiguous:
            ks.sort()
            print "***", fname, "maps", d(v), "back to", \
                  ", ".join(map(d, ks))
    else:
        print fname, "is free of ambiguous reverse maps."