[New-bugs-announce] [issue18291] codecs.open interprets space as line ends

Paul report at bugs.python.org
Mon Jun 24 15:11:12 CEST 2013


New submission from Paul:

I hope I am writing in the right place.

When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line.

Example code:

>>> with open('unicodetest.txt', 'w') as f:
>>>   f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
>>> with open('unicodetest.txt', 'r') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines:

>>> import codecs
>>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12
1 b\x13
2 c\x14
3 d\x15e

The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such.

As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs):

>>> import io
>>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

----------
components: IO, Unicode
messages: 191758
nosy: ezio.melotti, wpk
priority: normal
severity: normal
status: open
title: codecs.open interprets space as line ends
type: behavior
versions: Python 2.6, Python 2.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18291>
_______________________________________


More information about the New-bugs-announce mailing list