[ python-Bugs-706595 ] codecs.open and iterators

Sat Jan 15 18:38:18 CET 2005

Bugs item #706595, was opened at 2003-03-19 20:02
Message generated for change (Comment added) made by facundobatista
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=706595&group_id=5470

Category: Python Library
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Todd Reed (toddreed)
Assigned to: M.-A. Lemburg (lemburg)
Summary: codecs.open and iterators

Initial Comment:
Greg Aumann originally posted this problem in 
comp.lang.python on Nov 4, 2002, but I could not find a 
bug report.  I've simply copied his news post, which 
explains the problem:
-----------
Recently I figured out how to use iterators and 
generators. Quite easy to
use and a great improvement.

But when I refactored some of my code I came across a 
discrepancy that seems
like it must be a bug. If you use the old file reading idiom 
with a codec
the lines are converted to unicode but if you use the new 
iterators idiom
then they retain the original encoding and the line is 
returned in non
unicode strings. Surely using the new &quot;for line in file:&quot; 
idiom should give
the same result as the old, &quot;while 1: ....&quot;

I came across this when using the pythonzh Chinese 
codecs but the below code
uses the cp1252 encoding to illustrate the problem 
because everyone should
have those codecs. The symptoms are the same with 
both codecs.

I am using python 2.2.2 on win2k.

Is this definitely a bug, or is it an undocumented 'feature' 
of the codecs
module?

Greg Aumann

The following code illustrates the problem:
------------------------------------------------------------------------
&quot;&quot;&quot;Check readline iterator using a codec.&quot;&quot;&quot;

import codecs

fname = 'tmp.txt'
f = file(fname, 'w')
for i in range(0x82, 0x8c):
    f.write( '%x, %s\n' % (i, chr(i)))
f.close()

def test_iter():
    print '\ntesting codec iterator.'
    f = codecs.open(fname, 'r', 'cp1252')
    for line in f:
        l = line.rstrip()
        print repr(l)
        print repr(l.decode('cp1252'))
    f.close()

def test_readline():
    print '\ntesting codec readline.'
    f = codecs.open(fname, 'r', 'cp1252')
    while 1:
        line = f.readline()
        if not line:
            break
        l = line.rstrip()
        print repr(l)
        try:
            print repr(l.decode('cp1252'))
        except AttributeError, msg:
            print 'AttributeError', msg
    f.close()

test_iter()
test_readline()
------------------------------------------------------------------------
This code gives the following output:
------------------------------------------------------------------------
testing codec iterator.
'82, \x82'
u'82, \u201a'
'83, \x83'
u'83, \u0192'
'84, \x84'
u'84, \u201e'
'85, \x85'
u'85, \u2026'
'86, \x86'
u'86, \u2020'
'87, \x87'
u'87, \u2021'
'88, \x88'
u'88, \u02c6'
'89, \x89'
u'89, \u2030'
'8a, \x8a'
u'8a, \u0160'
'8b, \x8b'
u'8b, \u2039'

testing codec readline.
u'82, \u201a'
AttributeError 'unicode' object has no attribute 'decode'
u'83, \u0192'
AttributeError 'unicode' object has no attribute 'decode'
u'84, \u201e'
AttributeError 'unicode' object has no attribute 'decode'
u'85, \u2026'
AttributeError 'unicode' object has no attribute 'decode'
u'86, \u2020'
AttributeError 'unicode' object has no attribute 'decode'
u'87, \u2021'
AttributeError 'unicode' object has no attribute 'decode'
u'88, \u02c6'
AttributeError 'unicode' object has no attribute 'decode'
u'89, \u2030'
AttributeError 'unicode' object has no attribute 'decode'
u'8a, \u0160'
AttributeError 'unicode' object has no attribute 'decode'
u'8b, \u2039'
AttributeError 'unicode' object has no attribute 'decode'
------------------------------------------------------------------------

----------------------------------------------------------------------

>Comment By: Facundo Batista (facundobatista)
Date: 2005-01-15 14:38

Message:
Logged In: YES 
user_id=752496

Can not test it so far, all I got is:

testing codec iterator.
u'82, \u201a'

Traceback (most recent call last):
  ...
  File "C:\Python24\lib\encodings\cp1252.py", line 22, in decode
    return codecs.charmap_decode(input,errors,decoding_map)
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u201a' in position 4: ordinal not in range(128)

I'm on Win2k, sp2, with Py2.4

----------------------------------------------------------------------

Comment By: Facundo Batista (facundobatista)
Date: 2005-01-15 14:38

Message:
Logged In: YES 
user_id=752496

Please, could you verify if this problem persists in Python 2.3.4
or 2.4?

If yes, in which version? Can you provide a test case?

If the problem is solved, from which version?

Note that if you fail to answer in one month, I'll close this bug
as "Won't fix".

Thank you! 

.    Facundo

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-03-20 06:35

Message:
Logged In: YES 
user_id=38388

That's a bug in the iterator support which was added
to the codecs module: the .next() methods should not
call the .next() methods on the reader directly, but instead
redirect to the .readline() method.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=706595&group_id=5470