Splitting a file from specific column content

Yigit Turgut y.turgut at gmail.com
Sun Jan 22 11:17:39 EST 2012


On Jan 22, 4:45 pm, Roy Smith <r... at panix.com> wrote:
> In article
> <e1f0636a-195c-4fbb-931a-4d619d5f0... at g27g2000yqa.googlegroups.com>,
>  Yigit Turgut <y.tur... at gmail.com> wrote:

> > Hi all,
>
> > I have a text file approximately 20mb in size and contains about one
> > million lines. I was doing some processing on the data but then the
> > data rate increased and it takes very long time to process. I import
> > using numpy.loadtxt, here is a fragment of the data ;
>
> > 0.000006    -0.0004
> > 0.000071    0.0028
> > 0.000079    0.0044
> > 0.000086    0.0104
> > .
> > .
> > .
>
> > First column is the timestamp in seconds and second column is the
> > data. File contains 8seconds of measurement, and I would like to be
> > able to split the file into 3 parts seperated from specific time
> > locations. For example I want to divide the file into 3 parts, first
> > part containing 3 seconds of data, second containing 2 seconds of data
> > and third containing 3 seconds.
>
> I would do this with standard unix tools:
>
> grep '^[012]' input.txt > first-three-seconds.txt
> grep '^[34]' input.txt > next-two-seconds.txt
> grep '^[567]' input.txt > next-three-seconds.txt
>
> Sure, it makes three passes over the data, but for 20 MB of data, you
> could have the whole job done in less time than it took me to type this.
>
> As a sanity check, I would run "wc -l" on each of the files and confirm
> that they add up to the original line count.

This works and is very fast but it missed a few hundred lines
unfortunately.

On Jan 22, 5:19 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> On 22/01/2012 14:32, Yigit Turgut wrote:
> > Hi all,
>
> > I have a text file approximately 20mb in size and contains about one
> > million lines. I was doing some processing on the data but then the
> > data rate increased and it takes very long time to process. I import
> > using numpy.loadtxt, here is a fragment of the data ;
>
> > 0.000006    -0.0004
> > 0.000071    0.0028
> > 0.000079    0.0044
> > 0.000086    0.0104
> > .
> > .
> > .
>
> > First column is the timestamp in seconds and second column is the
> > data. File contains 8seconds of measurement, and I would like to be
> > able to split the file into 3 parts seperated from specific time
> > locations. For example I want to divide the file into 3 parts, first
> > part containing 3 seconds of data, second containing 2 seconds of data
> > and third containing 3 seconds. Splitting based on file size doesn't
> > work that accurately for this specific data, some columns become
> > missing and etc. I need to split depending on the column content ;
>
> > 1 - read file until first character of column1 is 3 (3 seconds)
> > 2 - save this region to another file
> > 3 - read the file where first characters  of column1 are between 3 to
> > 5 (2 seconds)
> > 4 - save this region to another file
> > 5 - read the file where first characters  of column1 are between 5 to
> > 5 (3 seconds)
> > 6 - save this region to another file
>
> > I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
> > get well with missing columns / rows. I even tried the invalidraise
> > parameter of genfromtxt but no luck.
>
> > I am sure it's a few lines of code for experienced users and I would
> > appreciate some guidance.
>
> Here's a solution in Python 3:
>
> input_path = "..."
> section_1_path = "..."
> section_2_path = "..."
> section_3_path = "..."
>
> with open(input_path) as input_file:
>      try:
>          line = next(input_file)
>
>          # Copy section 1.
>          with open(section_1_path, "w") as output_file:
>              while line[0] < "3":
>                  output_file.write(line)
>                  line = next(input_file)
>
>          # Copy section 2.
>          with open(section_2_path, "w") as output_file:
>              while line[5] < "5":
>                  output_file.write(line)
>                  line = next(input_file)
>
>          # Copy section 3.
>          with open(section_3_path, "w") as output_file:
>              while True:
>                  output_file.write(line)
>                  line = next(input_file)
>      except StopIteration:
>          pass

With the following correction ;

while line[5] < "5":
should be
while line[0] < "5":

This works well.

On Jan 22, 5:39 pm, Arnaud Delobelle <arno... at gmail.com> wrote:
> On 22 January 2012 15:19, MRAB <pyt... at mrabarnett.plus.com> wrote:
> > Here's a solution in Python 3:
>
> > input_path = "..."
> > section_1_path = "..."
> > section_2_path = "..."
> > section_3_path = "..."
>
> > with open(input_path) as input_file:
> >    try:
> >        line = next(input_file)
>
> >        # Copy section 1.
> >        with open(section_1_path, "w") as output_file:
> >            while line[0] < "3":
> >                output_file.write(line)
> >                line = next(input_file)
>
> >        # Copy section 2.
> >        with open(section_2_path, "w") as output_file:
> >            while line[5] < "5":
> >                output_file.write(line)
> >                line = next(input_file)
>
> >        # Copy section 3.
> >        with open(section_3_path, "w") as output_file:
> >            while True:
> >                output_file.write(line)
> >                line = next(input_file)
> >    except StopIteration:
> >        pass
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
> Or more succintly (but not tested):
>
> sections = [
>     ("3", "section_1")
>     ("5", "section_2")
>     ("\xFF", "section_3")
> ]
>
> with open(input_path) as input_file:
>     lines = iter(input_file)
>     for end, path in sections:
>         with open(path, "w") as output_file:
>             for line in lines:
>                 if line >= end:
>                     break
>                 output_file.write(line)
>
> --
> Arnaud

Good idea. Especially when dealing with variable numbers of sections.
But somehow  I got ;

    ("5", "section_2")
TypeError: 'tuple' object is not callable




More information about the Python-list mailing list