[Tutor] Python 3.2: processing text files in binary mode, because I want to remove carriage returns and line feeds...

Flynn, Stephen (L & P - IT) Steve.Flynn at capita.co.uk
Thu Aug 23 16:42:16 CEST 2012


Python 3.2, as in the subject, although I also have 2.7 on this machine
too.



I have some data which contains text separated with field delimiters
(|~) and a record terminator (||)

123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~some
text.|~xxxxxxx, xxxxx|~||
123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~some
more text|~xxxxx, xxxxxxa|~||
123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:03
|~Some Split text over
 serveral
 lines
|~Aldxxxxe, Mxxxx|~||
123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:53
|~Yet more text: 
 which has been split up with
 carriage returns, line feeds or possibly both, depending upon your
operating system.
|~Imxxx, xxxxxxed|~||


I'm trying to reformat this data so that each record terminated with a
"||" is on a single line, as in

123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~some
text.|~xxxxxxx, xxxxx|~||
123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~some
more text|~xxxxx, xxxxxxa|~||
123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:03
|~Some Split text over serveral lines|~Aldxxxxe, Mxxxx|~||
123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:53
|~Yet more text:  which has been split up with carriage returns, line
feeds or possibly both, depending upon your operating system.|~Imxxx,
xxxxxxed|~||



I've written the following code as a first attempt:

ifile=r"C:\Documents and Settings\flynns\Desktop\sample-DCLTBCNTH.txt"
ofile=r"C:\Documents and Settings\flynns\Desktop\output-DCLTBCNTH.txt"

f=open(ifile, mode="rb")
out=open(ofile, mode="w")
line=f.readline()

while (line) :
    if '||' in str(line):
        print(str(line), file=out)
    else:
        print(str(line), end='', file=out)
    line=f.readline()

if __name__ == '__main__':
    pass


The code attempts to read each line of the input file, and if it
contains a "||", print the line to an output file. If it doesn't contain
a "||" it emits the record without any carriage returns or line feeds
and grabs another line from the input file.

Whilst the "logic" seems to be working the output file I get out looks
like this:

b'123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~som
e text.|~xxxxxxx, xxxxx|~||\r\n'
b'123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~som
e more text|~xxxxx, xxxxxxa|~||\r\n'
b'123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:
03|~Some Split test over\r\n'b' serveral\r\n'b' lines\r\n'b'|~Aldxxxxe,
Mxxxx|~||\r\n'
b'123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:
53|~Yet more text: \r\n'b' which has been split up with\r\n'b' carriage
returns, line feeds or possibly both, depending upon your operating
system.\r\n'b'|~Imxxx, xxxxxxed|~||\r\n'

This makes sense to me as I'm writing the file out in text mode and the
\r and \n in the input stream are being interpreted as simple text.

However, if I try to write the file out in binary mode, I get a
traceback:

Traceback (most recent call last):
  File "C:\Documents and
Settings\flynns\workspace\joinlines\joinlines\joinlines.py", line 10, in
<module>
    print(str(line), file=out)
TypeError: 'str' does not support the buffer interface


Is there a method of writing out a binary mode file via print() and
making use of the end keyword?


If there's not, I presume I'll need to remove the \r\n from "line" in my
else: section and push the amended data out via an out.write(line). How
does one amend bytes in a "line" object



Steve Flynn



This email and any attachment to it are confidential.  Unless you are the intended recipient, you may not use, copy or disclose either the message or any information contained in the message. If you are not the intended recipient, you should delete this email and notify the sender immediately.

Any views or opinions expressed in this email are those of the sender only, unless otherwise stated.  All copyright in any Capita material in this email is reserved.

All emails, incoming and outgoing, may be recorded by Capita and monitored for legitimate business purposes. 

Capita exclude all liability for any loss or damage arising or resulting from the receipt, use or transmission of this email to the fullest extent permitted by law.


More information about the Tutor mailing list