[Python-bugs-list] [ python-Bugs-706595 ] codecs.open and iterators

SourceForge.net noreply@sourceforge.net
Thu, 20 Mar 2003 01:35:23 -0800


Bugs item #706595, was opened at 2003-03-20 00:02
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=706595&group_id=5470

Category: Python Library
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Todd Reed (toddreed)
>Assigned to: M.-A. Lemburg (lemburg)
Summary: codecs.open and iterators

Initial Comment:
Greg Aumann originally posted this problem in 
comp.lang.python on Nov 4, 2002, but I could not find a 
bug report.  I've simply copied his news post, which 
explains the problem:
-----------
Recently I figured out how to use iterators and 
generators. Quite easy to
use and a great improvement.

But when I refactored some of my code I came across a 
discrepancy that seems
like it must be a bug. If you use the old file reading idiom 
with a codec
the lines are converted to unicode but if you use the new 
iterators idiom
then they retain the original encoding and the line is 
returned in non
unicode strings. Surely using the new "for line in file:" 
idiom should give
the same result as the old, "while 1: ...."

I came across this when using the pythonzh Chinese 
codecs but the below code
uses the cp1252 encoding to illustrate the problem 
because everyone should
have those codecs. The symptoms are the same with 
both codecs.

I am using python 2.2.2 on win2k.

Is this definitely a bug, or is it an undocumented 'feature' 
of the codecs
module?

Greg Aumann

The following code illustrates the problem:
------------------------------------------------------------------------
"""Check readline iterator using a codec."""

import codecs

fname = 'tmp.txt'
f = file(fname, 'w')
for i in range(0x82, 0x8c):
    f.write( '%x, %s\n' % (i, chr(i)))
f.close()

def test_iter():
    print '\ntesting codec iterator.'
    f = codecs.open(fname, 'r', 'cp1252')
    for line in f:
        l = line.rstrip()
        print repr(l)
        print repr(l.decode('cp1252'))
    f.close()

def test_readline():
    print '\ntesting codec readline.'
    f = codecs.open(fname, 'r', 'cp1252')
    while 1:
        line = f.readline()
        if not line:
            break
        l = line.rstrip()
        print repr(l)
        try:
            print repr(l.decode('cp1252'))
        except AttributeError, msg:
            print 'AttributeError', msg
    f.close()

test_iter()
test_readline()
------------------------------------------------------------------------
This code gives the following output:
------------------------------------------------------------------------
testing codec iterator.
'82, \x82'
u'82, \u201a'
'83, \x83'
u'83, \u0192'
'84, \x84'
u'84, \u201e'
'85, \x85'
u'85, \u2026'
'86, \x86'
u'86, \u2020'
'87, \x87'
u'87, \u2021'
'88, \x88'
u'88, \u02c6'
'89, \x89'
u'89, \u2030'
'8a, \x8a'
u'8a, \u0160'
'8b, \x8b'
u'8b, \u2039'

testing codec readline.
u'82, \u201a'
AttributeError 'unicode' object has no attribute 'decode'
u'83, \u0192'
AttributeError 'unicode' object has no attribute 'decode'
u'84, \u201e'
AttributeError 'unicode' object has no attribute 'decode'
u'85, \u2026'
AttributeError 'unicode' object has no attribute 'decode'
u'86, \u2020'
AttributeError 'unicode' object has no attribute 'decode'
u'87, \u2021'
AttributeError 'unicode' object has no attribute 'decode'
u'88, \u02c6'
AttributeError 'unicode' object has no attribute 'decode'
u'89, \u2030'
AttributeError 'unicode' object has no attribute 'decode'
u'8a, \u0160'
AttributeError 'unicode' object has no attribute 'decode'
u'8b, \u2039'
AttributeError 'unicode' object has no attribute 'decode'
------------------------------------------------------------------------


----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2003-03-20 10:35

Message:
Logged In: YES 
user_id=38388

That's a bug in the iterator support which was added
to the codecs module: the .next() methods should not
call the .next() methods on the reader directly, but instead
redirect to the .readline() method.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=706595&group_id=5470