Re: [Python-Dev] PEP 385: the eol-type issue

Aug. 6, 2009

      M.-A. Lemburg wrote:
...
Nick Coghlan wrote:
...
Antoine Pitrou wrote:
...
M.-A. Lemburg <mal <at> egenix.com> writes:
...
Please file a bug report for this. f.readlines() (or rather
the io layer) should be using Py_UNICODE_ISLINEBREAK(ch)
for detecting line break characters.
Actually, no. It has been designed from the start to only recognize the
"standard" line break representations found in common formats/protocols (CR, LF
and CR+LF).
People wanting to split on arbitrary unicode line breaks should use
str.splitlines().
The fairly long-standing RFE relating to an arbitrarily selectable
newline separator seems relevant here:
http://bugs.python.org/issue1152248
As with the discussion there, the problem with using str.splitlines is
that it prevents pipelining approaches that avoid reading a whole file
into memory.
While removing the validity check from readlines() completely is
questionable (the readrecords() approach mentioned in the tracker issue
would still be better there), loosening the validity check to be based
on Py_UNICODE_IS_LINEBREAK seems a bit more feasible. (I'd still call it
a feature requests rather than a bug though).
I've had a look at the io implementation: this appears to be
based on the universal newline support idea which addresses
only a fixed set of "new line" character combinations and is
not as straight forward to extend to support all Unicode
line break characters as I thought.
What I don't understand is why the io layer tries to reinvent
the wheel here instead of just using the codec's .readline()
method - which *does* use .splitlines() and has full support
for all Unicode line break characters (including the CRLF
combination).
... and because of this, the feature is already available if
you use codecs.open() instead of the built-in open():

import codecs

with codecs.open("x.txt", "w", encoding='utf-8') as f:
  f.write("a\nb\u2029c\n")

with codecs.open("x.txt", "r", encoding='utf-8') as f:
  n = 1
  for l in f.readlines():
     print(n, repr(l))
     n += 1

This prints:

1 'a\n'
2 'b\u2029'
3 'c\n'

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 06 2009)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/