[ python-Bugs-706595 ] codecs.open and iterators
SourceForge.net
noreply at sourceforge.net
Sat Jan 15 18:38:18 CET 2005
Bugs item #706595, was opened at 2003-03-19 20:02
Message generated for change (Comment added) made by facundobatista
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=706595&group_id=5470
Category: Python Library
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Todd Reed (toddreed)
Assigned to: M.-A. Lemburg (lemburg)
Summary: codecs.open and iterators
Initial Comment:
Greg Aumann originally posted this problem in
comp.lang.python on Nov 4, 2002, but I could not find a
bug report. I've simply copied his news post, which
explains the problem:
-----------
Recently I figured out how to use iterators and
generators. Quite easy to
use and a great improvement.
But when I refactored some of my code I came across a
discrepancy that seems
like it must be a bug. If you use the old file reading idiom
with a codec
the lines are converted to unicode but if you use the new
iterators idiom
then they retain the original encoding and the line is
returned in non
unicode strings. Surely using the new "for line in file:"
idiom should give
the same result as the old, "while 1: ...."
I came across this when using the pythonzh Chinese
codecs but the below code
uses the cp1252 encoding to illustrate the problem
because everyone should
have those codecs. The symptoms are the same with
both codecs.
I am using python 2.2.2 on win2k.
Is this definitely a bug, or is it an undocumented 'feature'
of the codecs
module?
Greg Aumann
The following code illustrates the problem:
------------------------------------------------------------------------
"""Check readline iterator using a codec."""
import codecs
fname = 'tmp.txt'
f = file(fname, 'w')
for i in range(0x82, 0x8c):
f.write( '%x, %s\n' % (i, chr(i)))
f.close()
def test_iter():
print '\ntesting codec iterator.'
f = codecs.open(fname, 'r', 'cp1252')
for line in f:
l = line.rstrip()
print repr(l)
print repr(l.decode('cp1252'))
f.close()
def test_readline():
print '\ntesting codec readline.'
f = codecs.open(fname, 'r', 'cp1252')
while 1:
line = f.readline()
if not line:
break
l = line.rstrip()
print repr(l)
try:
print repr(l.decode('cp1252'))
except AttributeError, msg:
print 'AttributeError', msg
f.close()
test_iter()
test_readline()
------------------------------------------------------------------------
This code gives the following output:
------------------------------------------------------------------------
testing codec iterator.
'82, \x82'
u'82, \u201a'
'83, \x83'
u'83, \u0192'
'84, \x84'
u'84, \u201e'
'85, \x85'
u'85, \u2026'
'86, \x86'
u'86, \u2020'
'87, \x87'
u'87, \u2021'
'88, \x88'
u'88, \u02c6'
'89, \x89'
u'89, \u2030'
'8a, \x8a'
u'8a, \u0160'
'8b, \x8b'
u'8b, \u2039'
testing codec readline.
u'82, \u201a'
AttributeError 'unicode' object has no attribute 'decode'
u'83, \u0192'
AttributeError 'unicode' object has no attribute 'decode'
u'84, \u201e'
AttributeError 'unicode' object has no attribute 'decode'
u'85, \u2026'
AttributeError 'unicode' object has no attribute 'decode'
u'86, \u2020'
AttributeError 'unicode' object has no attribute 'decode'
u'87, \u2021'
AttributeError 'unicode' object has no attribute 'decode'
u'88, \u02c6'
AttributeError 'unicode' object has no attribute 'decode'
u'89, \u2030'
AttributeError 'unicode' object has no attribute 'decode'
u'8a, \u0160'
AttributeError 'unicode' object has no attribute 'decode'
u'8b, \u2039'
AttributeError 'unicode' object has no attribute 'decode'
------------------------------------------------------------------------
----------------------------------------------------------------------
>Comment By: Facundo Batista (facundobatista)
Date: 2005-01-15 14:38
Message:
Logged In: YES
user_id=752496
Can not test it so far, all I got is:
testing codec iterator.
u'82, \u201a'
Traceback (most recent call last):
...
File "C:\Python24\lib\encodings\cp1252.py", line 22, in decode
return codecs.charmap_decode(input,errors,decoding_map)
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u201a' in position 4: ordinal not in range(128)
I'm on Win2k, sp2, with Py2.4
----------------------------------------------------------------------
Comment By: Facundo Batista (facundobatista)
Date: 2005-01-15 14:38
Message:
Logged In: YES
user_id=752496
Please, could you verify if this problem persists in Python 2.3.4
or 2.4?
If yes, in which version? Can you provide a test case?
If the problem is solved, from which version?
Note that if you fail to answer in one month, I'll close this bug
as "Won't fix".
Thank you!
. Facundo
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2003-03-20 06:35
Message:
Logged In: YES
user_id=38388
That's a bug in the iterator support which was added
to the codecs module: the .next() methods should not
call the .next() methods on the reader directly, but instead
redirect to the .readline() method.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=706595&group_id=5470
More information about the Python-bugs-list
mailing list