M.-A. Lemburg wrote:
Nick Coghlan wrote:
Antoine Pitrou wrote:
M.-A. Lemburg <mal <at> egenix.com> writes:
Please file a bug report for this. f.readlines() (or rather the io layer) should be using Py_UNICODE_ISLINEBREAK(ch) for detecting line break characters.
Actually, no. It has been designed from the start to only recognize the "standard" line break representations found in common formats/protocols (CR, LF and CR+LF). People wanting to split on arbitrary unicode line breaks should use str.splitlines().
The fairly long-standing RFE relating to an arbitrarily selectable newline separator seems relevant here: http://bugs.python.org/issue1152248
As with the discussion there, the problem with using str.splitlines is that it prevents pipelining approaches that avoid reading a whole file into memory.
While removing the validity check from readlines() completely is questionable (the readrecords() approach mentioned in the tracker issue would still be better there), loosening the validity check to be based on Py_UNICODE_IS_LINEBREAK seems a bit more feasible. (I'd still call it a feature requests rather than a bug though).
I've had a look at the io implementation: this appears to be based on the universal newline support idea which addresses only a fixed set of "new line" character combinations and is not as straight forward to extend to support all Unicode line break characters as I thought.
What I don't understand is why the io layer tries to reinvent the wheel here instead of just using the codec's .readline() method - which *does* use .splitlines() and has full support for all Unicode line break characters (including the CRLF combination).
... and because of this, the feature is already available if you use codecs.open() instead of the built-in open():
with codecs.open("x.txt", "w", encoding='utf-8') as f: f.write("a\nb\u2029c\n")
with codecs.open("x.txt", "r", encoding='utf-8') as f: n = 1 for l in f.readlines(): print(n, repr(l)) n += 1
1 'a\n' 2 'b\u2029' 3 'c\n'