[Tutor] Python 3.2: processing text files in binary mode, because I want to remove carriage returns and line feeds...

Peter Otten __peter__ at web.de
Thu Aug 23 18:16:40 CEST 2012


Flynn, Stephen (L & P - IT) wrote:

> Python 3.2, as in the subject, although I also have 2.7 on this machine
> too.
> 
> 
> 
> I have some data which contains text separated with field delimiters
> (|~) and a record terminator (||)
> 
> 123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~some
> text.|~xxxxxxx, xxxxx|~||
> 123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~some
> more text|~xxxxx, xxxxxxa|~||
> 123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:03
> |~Some Split text over
>  serveral
>  lines
> |~Aldxxxxe, Mxxxx|~||
> 123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:53
> |~Yet more text:
>  which has been split up with
>  carriage returns, line feeds or possibly both, depending upon your
> operating system.
> |~Imxxx, xxxxxxed|~||
> 
> 
> I'm trying to reformat this data so that each record terminated with a
> "||" is on a single line, as in
> 
> 123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~some
> text.|~xxxxxxx, xxxxx|~||
> 123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~some
> more text|~xxxxx, xxxxxxa|~||
> 123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:03
> |~Some Split text over serveral lines|~Aldxxxxe, Mxxxx|~||
> 123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:53
> |~Yet more text:  which has been split up with carriage returns, line
> feeds or possibly both, depending upon your operating system.|~Imxxx,
> xxxxxxed|~||
> 
> 
> 
> I've written the following code as a first attempt:
> 
> ifile=r"C:\Documents and Settings\flynns\Desktop\sample-DCLTBCNTH.txt"
> ofile=r"C:\Documents and Settings\flynns\Desktop\output-DCLTBCNTH.txt"
> 
> f=open(ifile, mode="rb")
> out=open(ofile, mode="w")
> line=f.readline()
> 
> while (line) :
>     if '||' in str(line):
>         print(str(line), file=out)
>     else:
>         print(str(line), end='', file=out)
>     line=f.readline()
> 
> if __name__ == '__main__':
>     pass
> 
> 
> The code attempts to read each line of the input file, and if it
> contains a "||", print the line to an output file. If it doesn't contain
> a "||" it emits the record without any carriage returns or line feeds
> and grabs another line from the input file.
> 
> Whilst the "logic" seems to be working the output file I get out looks
> like this:
> 
> b'123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~som
> e text.|~xxxxxxx, xxxxx|~||\r\n'
> b'123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~som
> e more text|~xxxxx, xxxxxxa|~||\r\n'
> b'123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:
> 03|~Some Split test over\r\n'b' serveral\r\n'b' lines\r\n'b'|~Aldxxxxe,
> Mxxxx|~||\r\n'
> b'123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:
> 53|~Yet more text: \r\n'b' which has been split up with\r\n'b' carriage
> returns, line feeds or possibly both, depending upon your operating
> system.\r\n'b'|~Imxxx, xxxxxxed|~||\r\n'
> 
> This makes sense to me as I'm writing the file out in text mode and the
> \r and \n in the input stream are being interpreted as simple text.
> 
> However, if I try to write the file out in binary mode, I get a
> traceback:
> 
> Traceback (most recent call last):
>   File "C:\Documents and
> Settings\flynns\workspace\joinlines\joinlines\joinlines.py", line 10, in
> <module>
>     print(str(line), file=out)
> TypeError: 'str' does not support the buffer interface
> 
> 
> Is there a method of writing out a binary mode file via print() and
> making use of the end keyword?
> 
> 
> If there's not, I presume I'll need to remove the \r\n from "line" in my
> else: section and push the amended data out via an out.write(line). How
> does one amend bytes in a "line" object

In binary mode the lines you are reading are bytes not str objects. If you 
want to convert from bytes to str use the decode method. Compare:

>>> line = b"whatever"
>>> print(str(line))
b'whatever'
>>> print(line.decode())
whatever

However, I don't see why you have to open your file in binary mode. 
Something like

with open(infile) as instream:
    with open(outfile, "w") as outstream:
        for line in instream:
            if not line.endswith("||\n"):
                line = line.rstrip("\n")
            outstream.write(line)


should do the right thing.



More information about the Tutor mailing list