[Python-Dev] unicode() and its error argument

Tim Peters tim.one@comcast.net
Sun, 16 Jun 2002 12:41:23 -0400


[Skip Montanaro]
> ...
> Tim's inability to provoke errors was also suggestive that it was pilot
> error, not a problem with the plane.

Ya, but what do I know about encodings?  "Nothing" is right -- that's why I
wrote a program to generate stuff at random.

Taking that another step, to generate the encoding at random too, turns up
at least one way to crash Python:  the attached program eventually crashes
when doing a utf7 decode.  It appears to be in this line:

            if ((ch == '-') || !B64CHAR(ch)) {

and ch "is big" when it blows up.  I assume this is because B64CHAR(ch)
expands in part to isalnum(ch), and on Windows the latter is done via array
lookup (and ch is out-of-bounds).

Other failures I've seen out of this are benign, like

>>> unicode('\xf1R\x7f^C\x1e\xd8', 'hex_codec', 'ignore')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\CODE\PYTHON\lib\encodings\hex_codec.py", line 41, in hex_decode
    assert errors == 'strict'
AssertionError
>>>



from random import choice, randint
from traceback import print_exc

bytes = [chr(i) for i in range(256)]
paste = ''.join

def generrors(encoding, errors, maxlen, maxtries):
    for dummy in xrange(maxtries):
        n = randint(1, maxlen)
        raw = paste([choice(bytes) for dummy in range(n)])
        try:
            u = unicode(raw, encoding, errors)
        except:
            print 'failure in unicode(%r, %r, %r)' % (raw, encoding, errors)
            print_exc(0)
            return 1
    return 0

from encodings.aliases import aliases
unique = aliases.values()
unique = dict(zip(unique, unique)).keys()

while unique:
    e = choice(unique)
    print
    print 'Trying', e
    if generrors(e, 'ignore', 10, 1000):
        unique.remove(e)