[Tutor] Python 3.2: processing text files in binary mode, because I want to remove carriage returns and line feeds...
Peter Otten
__peter__ at web.de
Thu Aug 23 18:16:40 CEST 2012
Flynn, Stephen (L & P - IT) wrote:
> Python 3.2, as in the subject, although I also have 2.7 on this machine
> too.
>
>
>
> I have some data which contains text separated with field delimiters
> (|~) and a record terminator (||)
>
> 123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~some
> text.|~xxxxxxx, xxxxx|~||
> 123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~some
> more text|~xxxxx, xxxxxxa|~||
> 123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:03
> |~Some Split text over
> serveral
> lines
> |~Aldxxxxe, Mxxxx|~||
> 123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:53
> |~Yet more text:
> which has been split up with
> carriage returns, line feeds or possibly both, depending upon your
> operating system.
> |~Imxxx, xxxxxxed|~||
>
>
> I'm trying to reformat this data so that each record terminated with a
> "||" is on a single line, as in
>
> 123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~some
> text.|~xxxxxxx, xxxxx|~||
> 123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~some
> more text|~xxxxx, xxxxxxa|~||
> 123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:03
> |~Some Split text over serveral lines|~Aldxxxxe, Mxxxx|~||
> 123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:53
> |~Yet more text: which has been split up with carriage returns, line
> feeds or possibly both, depending upon your operating system.|~Imxxx,
> xxxxxxed|~||
>
>
>
> I've written the following code as a first attempt:
>
> ifile=r"C:\Documents and Settings\flynns\Desktop\sample-DCLTBCNTH.txt"
> ofile=r"C:\Documents and Settings\flynns\Desktop\output-DCLTBCNTH.txt"
>
> f=open(ifile, mode="rb")
> out=open(ofile, mode="w")
> line=f.readline()
>
> while (line) :
> if '||' in str(line):
> print(str(line), file=out)
> else:
> print(str(line), end='', file=out)
> line=f.readline()
>
> if __name__ == '__main__':
> pass
>
>
> The code attempts to read each line of the input file, and if it
> contains a "||", print the line to an output file. If it doesn't contain
> a "||" it emits the record without any carriage returns or line feeds
> and grabs another line from the input file.
>
> Whilst the "logic" seems to be working the output file I get out looks
> like this:
>
> b'123456009999990|~52299999|~9999990|~0|~4|~1|~2006-09-08|~13:29:39|~som
> e text.|~xxxxxxx, xxxxx|~||\r\n'
> b'123456009999991|~52299999|~1999999|~0|~4|~1|~2009-06-05|~15:25:25|~som
> e more text|~xxxxx, xxxxxxa|~||\r\n'
> b'123456009999992|~51199999|~9999998|~8253265|~5|~11|~2011-07-19|~16:55:
> 03|~Some Split test over\r\n'b' serveral\r\n'b' lines\r\n'b'|~Aldxxxxe,
> Mxxxx|~||\r\n'
> b'123456009999993|~59999999|~2999999|~8253265|~5|~11|~2011-07-11|~15:06:
> 53|~Yet more text: \r\n'b' which has been split up with\r\n'b' carriage
> returns, line feeds or possibly both, depending upon your operating
> system.\r\n'b'|~Imxxx, xxxxxxed|~||\r\n'
>
> This makes sense to me as I'm writing the file out in text mode and the
> \r and \n in the input stream are being interpreted as simple text.
>
> However, if I try to write the file out in binary mode, I get a
> traceback:
>
> Traceback (most recent call last):
> File "C:\Documents and
> Settings\flynns\workspace\joinlines\joinlines\joinlines.py", line 10, in
> <module>
> print(str(line), file=out)
> TypeError: 'str' does not support the buffer interface
>
>
> Is there a method of writing out a binary mode file via print() and
> making use of the end keyword?
>
>
> If there's not, I presume I'll need to remove the \r\n from "line" in my
> else: section and push the amended data out via an out.write(line). How
> does one amend bytes in a "line" object
In binary mode the lines you are reading are bytes not str objects. If you
want to convert from bytes to str use the decode method. Compare:
>>> line = b"whatever"
>>> print(str(line))
b'whatever'
>>> print(line.decode())
whatever
However, I don't see why you have to open your file in binary mode.
Something like
with open(infile) as instream:
with open(outfile, "w") as outstream:
for line in instream:
if not line.endswith("||\n"):
line = line.rstrip("\n")
outstream.write(line)
should do the right thing.
More information about the Tutor
mailing list