[Python-bugs-list] [ python-Bugs-634246 ] for line in file iterator ignores codecs

noreply@sourceforge.net noreply@sourceforge.net
Tue, 05 Nov 2002 20:14:51 -0800


Bugs item #634246, was opened at 2002-11-06 04:14
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=634246&group_id=5470

Category: Unicode
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Greg Aumann (gaumann)
Assigned to: M.-A. Lemburg (lemburg)
Summary: for line in file iterator ignores codecs

Initial Comment:
If you use the old file reading idiom with a codec the
lines are converted to unicode but if you use the new
iterators idiom then they retain the original encoding
and are returned in non unicode strings. Using the new
"for line in file:" idiom should give the same result
as the old, "while 1: ...."

I came across this when using the pythonzh Chinese
codecs but the below code uses the cp1252 encoding to
illustrate the problem because everyone should have
those codecs. The symptoms are the same with both codecs. 

I am using python 2.2.2 on win2k. 

The following code illustrates the problem:
------------------------------------------------------------------------
"""Check readline iterator using a codec."""

import codecs

fname = 'tmp.txt'
f = file(fname, 'w')
for i in range(0x82, 0x8c):
    f.write( '%x, %s\n' % (i, chr(i)))
f.close()

def test_iter():
    print '\ntesting codec iterator.'
    f = codecs.open(fname, 'r', 'cp1252')
    for line in f:
        l = line.rstrip()
        print repr(l)
        print repr(l.decode('cp1252'))
    f.close()

def test_readline():
    print '\ntesting codec readline.'
    f = codecs.open(fname, 'r', 'cp1252')
    while 1:
        line = f.readline()
        if not line:
            break
        l = line.rstrip()
        print repr(l)
        try:
            print repr(l.decode('cp1252'))
        except AttributeError, msg:
            print 'AttributeError', msg
    f.close()

test_iter()
test_readline()
------------------------------------------------------------------------
This code gives the following output:
------------------------------------------------------------------------
testing codec iterator.
'82, \x82'
u'82, \u201a'
'83, \x83'
u'83, \u0192'
'84, \x84'
u'84, \u201e'
'85, \x85'
u'85, \u2026'
'86, \x86'
u'86, \u2020'
'87, \x87'
u'87, \u2021'
'88, \x88'
u'88, \u02c6'
'89, \x89'
u'89, \u2030'
'8a, \x8a'
u'8a, \u0160'
'8b, \x8b'
u'8b, \u2039'

testing codec readline.
u'82, \u201a'
AttributeError 'unicode' object has no attribute 'decode'
u'83, \u0192'
AttributeError 'unicode' object has no attribute 'decode'
u'84, \u201e'
AttributeError 'unicode' object has no attribute 'decode'
u'85, \u2026'
AttributeError 'unicode' object has no attribute 'decode'
u'86, \u2020'
AttributeError 'unicode' object has no attribute 'decode'
u'87, \u2021'
AttributeError 'unicode' object has no attribute 'decode'
u'88, \u02c6'
AttributeError 'unicode' object has no attribute 'decode'
u'89, \u2030'
AttributeError 'unicode' object has no attribute 'decode'
u'8a, \u0160'
AttributeError 'unicode' object has no attribute 'decode'
u'8b, \u2039'
AttributeError 'unicode' object has no attribute 'decode'
------------------------------------------------------------------------


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=634246&group_id=5470