Splitting a file from specific column content
Yigit Turgut
y.turgut at gmail.com
Sun Jan 22 11:17:39 EST 2012
On Jan 22, 4:45 pm, Roy Smith <r... at panix.com> wrote:
> In article
> <e1f0636a-195c-4fbb-931a-4d619d5f0... at g27g2000yqa.googlegroups.com>,
> Yigit Turgut <y.tur... at gmail.com> wrote:
> > Hi all,
>
> > I have a text file approximately 20mb in size and contains about one
> > million lines. I was doing some processing on the data but then the
> > data rate increased and it takes very long time to process. I import
> > using numpy.loadtxt, here is a fragment of the data ;
>
> > 0.000006 -0.0004
> > 0.000071 0.0028
> > 0.000079 0.0044
> > 0.000086 0.0104
> > .
> > .
> > .
>
> > First column is the timestamp in seconds and second column is the
> > data. File contains 8seconds of measurement, and I would like to be
> > able to split the file into 3 parts seperated from specific time
> > locations. For example I want to divide the file into 3 parts, first
> > part containing 3 seconds of data, second containing 2 seconds of data
> > and third containing 3 seconds.
>
> I would do this with standard unix tools:
>
> grep '^[012]' input.txt > first-three-seconds.txt
> grep '^[34]' input.txt > next-two-seconds.txt
> grep '^[567]' input.txt > next-three-seconds.txt
>
> Sure, it makes three passes over the data, but for 20 MB of data, you
> could have the whole job done in less time than it took me to type this.
>
> As a sanity check, I would run "wc -l" on each of the files and confirm
> that they add up to the original line count.
This works and is very fast but it missed a few hundred lines
unfortunately.
On Jan 22, 5:19 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> On 22/01/2012 14:32, Yigit Turgut wrote:
> > Hi all,
>
> > I have a text file approximately 20mb in size and contains about one
> > million lines. I was doing some processing on the data but then the
> > data rate increased and it takes very long time to process. I import
> > using numpy.loadtxt, here is a fragment of the data ;
>
> > 0.000006 -0.0004
> > 0.000071 0.0028
> > 0.000079 0.0044
> > 0.000086 0.0104
> > .
> > .
> > .
>
> > First column is the timestamp in seconds and second column is the
> > data. File contains 8seconds of measurement, and I would like to be
> > able to split the file into 3 parts seperated from specific time
> > locations. For example I want to divide the file into 3 parts, first
> > part containing 3 seconds of data, second containing 2 seconds of data
> > and third containing 3 seconds. Splitting based on file size doesn't
> > work that accurately for this specific data, some columns become
> > missing and etc. I need to split depending on the column content ;
>
> > 1 - read file until first character of column1 is 3 (3 seconds)
> > 2 - save this region to another file
> > 3 - read the file where first characters of column1 are between 3 to
> > 5 (2 seconds)
> > 4 - save this region to another file
> > 5 - read the file where first characters of column1 are between 5 to
> > 5 (3 seconds)
> > 6 - save this region to another file
>
> > I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
> > get well with missing columns / rows. I even tried the invalidraise
> > parameter of genfromtxt but no luck.
>
> > I am sure it's a few lines of code for experienced users and I would
> > appreciate some guidance.
>
> Here's a solution in Python 3:
>
> input_path = "..."
> section_1_path = "..."
> section_2_path = "..."
> section_3_path = "..."
>
> with open(input_path) as input_file:
> try:
> line = next(input_file)
>
> # Copy section 1.
> with open(section_1_path, "w") as output_file:
> while line[0] < "3":
> output_file.write(line)
> line = next(input_file)
>
> # Copy section 2.
> with open(section_2_path, "w") as output_file:
> while line[5] < "5":
> output_file.write(line)
> line = next(input_file)
>
> # Copy section 3.
> with open(section_3_path, "w") as output_file:
> while True:
> output_file.write(line)
> line = next(input_file)
> except StopIteration:
> pass
With the following correction ;
while line[5] < "5":
should be
while line[0] < "5":
This works well.
On Jan 22, 5:39 pm, Arnaud Delobelle <arno... at gmail.com> wrote:
> On 22 January 2012 15:19, MRAB <pyt... at mrabarnett.plus.com> wrote:
> > Here's a solution in Python 3:
>
> > input_path = "..."
> > section_1_path = "..."
> > section_2_path = "..."
> > section_3_path = "..."
>
> > with open(input_path) as input_file:
> > try:
> > line = next(input_file)
>
> > # Copy section 1.
> > with open(section_1_path, "w") as output_file:
> > while line[0] < "3":
> > output_file.write(line)
> > line = next(input_file)
>
> > # Copy section 2.
> > with open(section_2_path, "w") as output_file:
> > while line[5] < "5":
> > output_file.write(line)
> > line = next(input_file)
>
> > # Copy section 3.
> > with open(section_3_path, "w") as output_file:
> > while True:
> > output_file.write(line)
> > line = next(input_file)
> > except StopIteration:
> > pass
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
> Or more succintly (but not tested):
>
> sections = [
> ("3", "section_1")
> ("5", "section_2")
> ("\xFF", "section_3")
> ]
>
> with open(input_path) as input_file:
> lines = iter(input_file)
> for end, path in sections:
> with open(path, "w") as output_file:
> for line in lines:
> if line >= end:
> break
> output_file.write(line)
>
> --
> Arnaud
Good idea. Especially when dealing with variable numbers of sections.
But somehow I got ;
("5", "section_2")
TypeError: 'tuple' object is not callable
More information about the Python-list
mailing list