[Python-bugs-list] [ python-Bugs-634246 ] for line in file iterator ignores codecs

noreply@sourceforge.net noreply@sourceforge.net
Wed, 06 Nov 2002 08:39:25 -0800


Bugs item #634246, was opened at 2002-11-06 05:14
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=634246&group_id=5470

Category: Unicode
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Greg Aumann (gaumann)
>Assigned to: Walter Dörwald (doerwalter)
Summary: for line in file iterator ignores codecs

Initial Comment:
If you use the old file reading idiom with a codec the
lines are converted to unicode but if you use the new
iterators idiom then they retain the original encoding
and are returned in non unicode strings. Using the new
"for line in file:" idiom should give the same result
as the old, "while 1: ...."

I came across this when using the pythonzh Chinese
codecs but the below code uses the cp1252 encoding to
illustrate the problem because everyone should have
those codecs. The symptoms are the same with both codecs. 

I am using python 2.2.2 on win2k. 

The following code illustrates the problem:
------------------------------------------------------------------------
"""Check readline iterator using a codec."""

import codecs

fname = 'tmp.txt'
f = file(fname, 'w')
for i in range(0x82, 0x8c):
    f.write( '%x, %s\n' % (i, chr(i)))
f.close()

def test_iter():
    print '\ntesting codec iterator.'
    f = codecs.open(fname, 'r', 'cp1252')
    for line in f:
        l = line.rstrip()
        print repr(l)
        print repr(l.decode('cp1252'))
    f.close()

def test_readline():
    print '\ntesting codec readline.'
    f = codecs.open(fname, 'r', 'cp1252')
    while 1:
        line = f.readline()
        if not line:
            break
        l = line.rstrip()
        print repr(l)
        try:
            print repr(l.decode('cp1252'))
        except AttributeError, msg:
            print 'AttributeError', msg
    f.close()

test_iter()
test_readline()
------------------------------------------------------------------------
This code gives the following output:
------------------------------------------------------------------------
testing codec iterator.
'82, \x82'
u'82, \u201a'
'83, \x83'
u'83, \u0192'
'84, \x84'
u'84, \u201e'
'85, \x85'
u'85, \u2026'
'86, \x86'
u'86, \u2020'
'87, \x87'
u'87, \u2021'
'88, \x88'
u'88, \u02c6'
'89, \x89'
u'89, \u2030'
'8a, \x8a'
u'8a, \u0160'
'8b, \x8b'
u'8b, \u2039'

testing codec readline.
u'82, \u201a'
AttributeError 'unicode' object has no attribute 'decode'
u'83, \u0192'
AttributeError 'unicode' object has no attribute 'decode'
u'84, \u201e'
AttributeError 'unicode' object has no attribute 'decode'
u'85, \u2026'
AttributeError 'unicode' object has no attribute 'decode'
u'86, \u2020'
AttributeError 'unicode' object has no attribute 'decode'
u'87, \u2021'
AttributeError 'unicode' object has no attribute 'decode'
u'88, \u02c6'
AttributeError 'unicode' object has no attribute 'decode'
u'89, \u2030'
AttributeError 'unicode' object has no attribute 'decode'
u'8a, \u0160'
AttributeError 'unicode' object has no attribute 'decode'
u'8b, \u2039'
AttributeError 'unicode' object has no attribute 'decode'
------------------------------------------------------------------------


----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2002-11-06 17:39

Message:
Logged In: YES 
user_id=38388

Walter, the patch looks good. Please check it in.

A patch for StreamRecoder should look the same
since that's how file iterators should work (ie.
return a complete line in each iteration).

Thanks.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-11-06 16:26

Message:
Logged In: YES 
user_id=89016

The problem is that StreamReader and StreamReaderWriter
don't have iterator methods, i.e. next() and __iter__() are
missing and will be "inherited" from the files via
__getattr__. That's why your "for line in f:" works anyway. 

The attached patch (diff.txt) adds next() and __iter__() to
StreamReader and StreamReaderWriter. I don't know what must
be done for StreamRecoder.

BTW, your decode calls are wrong, because readline *does*
return unicode objects, and you can encode unicode objects
to str objects or decode str objects into unicode objects.
They work in your iterator case, because the fallback
iterator implementation from the underlying stream returns
str objects.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=634246&group_id=5470