Access violation in Python shell

David Hughes dfh at forestfield.co.uk
Sat Oct 19 02:22:42 EDT 2002


I wrote the following piece of code to try and clarify for myself what
happens when Python coerces byte strings into unicode if there are bytes >
127 present and the default ascii encoding is set.

#---uni01.py---
# test the combinations of unicode with 8 bit bytes ie. not ascii

def cencode(s):
    """" Conditionally encode 's' if unicode else do nothing.
         Needed to prevent error if attempting to encode a
         byte string containing codes > 128
    """
    import types
    if isinstance(s, types.UnicodeType):
        return s.encode('latin-1')
    else:
        return s

ch = [ 'a', '\xe2', u'a', u'\xe2' ]

word = [ 'gateau', 'g\xe2teau', u'gateau', u'g\xe2teau']

for k in range(3):
    print '\nIteration', k+1
    for w in word:
        for c in ch:
            try:
                if c in w:
                    print 'yes', cencode(c), type(c), cencode(w), type(w)
                else:
                    print 'no',  cencode(c), type(c), cencode(w), type(w)
            except UnicodeError, e:
                print 'Unicode error.', e,          \
                          cencode(c), type(c), cencode(w), type(w)
            except TypeError, e:
                print 'Type error.', e,             \
                          cencode(c), type(c), cencode(w), type(w)
        print

---End---

It ran as expected on the first iteration and the exceptions make sense
after thinking about them (Although why some come up as TypeErrors, I don't
know). But during the second iteration the value of ch[2] changes somehow -
it is ok at the end of the first iteration. On attempting to re-run the
code, the Python shell terminates. This, or something like it was originally
happening with Python 2.2 so I upgraded to 2.2.2

--Output---

E:\Pydevsrc\Test\unicode>python
Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> execfile('uni01.py')

Iteration 1
yes a <type 'str'> gateau <type 'str'>
no G <type 'str'> gateau <type 'str'>
yes a <type 'unicode'> gateau <type 'str'>
no G <type 'unicode'> gateau <type 'str'>

yes a <type 'str'> gGteau <type 'str'>
yes G <type 'str'> gGteau <type 'str'>
Unicode error. ASCII decoding error: ordinal not in range(128) a <type
'unicode'> gGteau <type 'str'>
Unicode error. ASCII decoding error: ordinal not in range(128) G <type
'unicode'> gGteau <type 'str'>

yes a <type 'str'> gateau <type 'unicode'>
Type error. 'in <string>' requires character as left operand G <type 'str'>
gateau <type 'unicode'>
yes a <type 'unicode'> gateau <type 'unicode'>
no G <type 'unicode'> gateau <type 'unicode'>

yes a <type 'str'> gGteau <type 'unicode'>
Type error. 'in <string>' requires character as left operand G <type 'str'>
gGteau <type 'unicode'>
yes a <type 'unicode'> gGteau <type 'unicode'>
yes G <type 'unicode'> gGteau <type 'unicode'>


Iteration 2
yes a <type 'str'> gateau <type 'str'>
no G <type 'str'> gateau <type 'str'>
yes a <type 'unicode'> gateau <type 'str'>
no G <type 'unicode'> gateau <type 'str'>

yes a <type 'str'> gGteau <type 'str'>
yes G <type 'str'> gGteau <type 'str'>
Unicode error. ASCII decoding error: ordinal not in range(128) a <type
'unicode'> gGteau <type 'str'>
Unicode error. ASCII decoding error: ordinal not in range(128) G <type
'unicode'> gGteau <type 'str'>

yes a <type 'str'> gateau <type 'unicode'>
Type error. 'in <string>' requires character as left operand G <type 'str'>
gateau <type 'unicode'>
Type error. 'in <string>' requires character as left operand
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "uni01.py", line 30, in ?
    print 'Type error.', e, cencode(c), type(c), cencode(w), type(w)
  File "uni01.py", line 10, in cencode
    return s.encode('latin-1')
UnicodeError: Latin-1 encoding error: ordinal not in range(256)
>>> ch
['a', '\xe2', u'g\x00\u0178t\x00\x00', u'\xe2']
>>> word
['gateau', 'g\xe2teau', u'gateau', u'g\xe2teau']
>>> execfile('uni01.py')

  [ Python shell crashed here ]

---------------------------------

The 8 bit character \xe2 was originally an a-circumflex. It rendered like a
greek Tau in the Python output but ends up as a 'G' in the above copy.

Can anyone reproduce or shed any light on this problem, please, or am I
making a public demonstration of stupidity here?

Regards,
David Hughes






More information about the Python-list mailing list