[Python-Dev] New lines, carriage returns, and Windows
p.f.moore at gmail.com
Fri Sep 28 20:28:10 CEST 2007
On 26/09/2007, Dino Viehland <dinov at exchange.microsoft.com> wrote:
> My understanding is that users can write code that uses only \n and Python will write the
> end-of-line character(s) that are appropriate for the platform when writing to a file. That's
> what I meant by uses \n for everything internally.
OK, so far so good - although I'm not *quite* sure there's a
self-consistent definition of "code that only uses \n". I'll assume
you mean code that has a concept of lines, that lines never contain
anything other than text (specifically, neither \r or \n can appear in
a line, I'll punt on whether other weird stuff like form feed are
legal), and that whenever your code needs to write data to a file, it
writes lines with \n alone between them.
> But if you write \r\n to a file Python completely ignores the presence of the \r and
> transforms the \n into a \r\n anyway, hence the \r\r in the resulting stream. My last
> question is simply does anyone find writing \r\r\n when the original string contained \r\n a
> useful behavior - personally I don't see how it is.
In the above model, lines can't contain \r and between lines you only
ever write \n - so where did the \r\n come from?
If you receive what you think are lines from an outside source, and
they contain \r, then you didn't sanity check your data.
If you receive a block of raw (effectively binary!) data which you
want to translate into your model, it's up to you how you cut it up
If you read data using one of Python's text modes, it's up to you to
understand how it works.
> But Guido's response makes this sound like it's a problem w/ VC++ stdio implementation
> and not something that Python is explicitly doing.
I'm not sure it's a CRT issue. Certainly the \r\n vs \n confusion
comes from the CRT - the underlying OS (just like Unix!!!!) only deals
in files as streams of bytes. But ultimately, "lines" are an
abstraction in your code. All the CRT (and Python) do is help (or
maybe hinder) you with the "normal" cases.
> Anyway, it'd might be useful to have a text-mode file that you can write \r\n to and only
> get \r\n in the resulting file.
I can't comment on that, other than to say that if you better defined
the semantic model (lines, how things are encoded/decoded to files,
etc, somewhat like I tried to above) it would be more obvious what use
case this was trying to address.
> But if the general sentiment is s.replace('\r', '') is the way to go we can advice our users
> of the behavior when interoperating w/ APIs that return \r\n in strings.
I'd say users of the relevant APIs need to understand how the APIs
represent "lines", so that they can convert the received data to their
program's model of lines. Of course, that probably corresponds to
something like s.replace('\r','') or likely more correctly data_lines
= s.split('\r\n'). A "rule of thumb" that doesn't make it clear that
the concept of "line" has 2 different binary representations in 2
different areas (data back from APIs vs data from files) is likely to
ultimately lead to mistakes and confusion.
If you think this is bad, wait until you have to deal with Unicode
issues like what *encoding* the data is being supplied to you in.
Makes guessing newline conventions seem simple (at least to this
parochial English-speaker :-)) Although as this is IronPython, you may
already have that covered...
PS In real life, you often just want a cheap and cheerful answer. For
that, "strip out spurious \r characters" may be fine.
More information about the Python-Dev