CSV reader ignore brackets

MRAB python at mrabarnett.plus.com
Tue Sep 24 19:50:40 EDT 2019


On 2019-09-25 00:09, Cameron Simpson wrote:
> On 24Sep2019 15:55, Mihir Kothari <mihir.kothari at gmail.com> wrote:
>>I am using python 3.4. I have a CSV file as below:
>>
>>ABC,PQR,(TEST1,TEST2)
>>FQW,RTE,MDE
> 
> Really? No quotes around the (TEST1,TEST2) column value? I would have
> said this is invalid data, but that does not help you.
> 
>>Basically comma-separated rows, where some rows have a data in column which
>>is array like i.e. in brackets.
>>So I need to read the file and treat such columns as one i.e. do not
>>separate based on comma if it is inside the bracket.
>>
>>In short I need to read a CSV file where separator inside the brackets
>>needs to be ignored.
>>
>>Output:
>>Column:   1       2                3
>>Row1:    ABC  PQR  (TEST1,TEST2)
>>Row2:    FQW  RTE  MDE
>>
>>Can you please help with the snippet?
> 
> I would be reaching for a regular expression. If you partition your
> values into 2 types: those starting and ending in a bracket, and those
> not, you could write a regular expression for the former:
> 
>      \([^)]*\)
> 
> which matches a string like (.....) (with, importantly, no embedded
> brackets, only those at the beginning and end.
> 
> And you can write a regular expression like:
> 
>      [^,]*
> 
> for a value containing no commas i.e. all the other values.
> 
> Test the bracketed one first, because the second one always matches
> something.
> 
> Then you would not use the CSV module (which expects better formed data
> than you have) and instead write a simple parser for a line of text
> which tries to match one of these two expressions repeatedly to consume
> the line. Something like this (UNTESTED):
> 
>      bracketed_re = re.compile(r'\([^)]*\)')
>      no_commas_re = re.compile(r'[^,]*')
> 
>      def split_line(line):
>        line = line.rstrip()  # drop trailing whitespace/newline
>        fields = []
>        offset = 0
>        while offset < len(line):
>          m = bracketed_re.match(line, offset)
>          if m:
>            field = m.group()
>          else:
>            m = no_commas_re.match(line, offset)   # this always matches
>            field = m.group()
>          fields.append(field)
>          offset += len(field)
>          if line.startswith(',', offset):
>            # another column
>            offset += 1
>          elif offset < len(line):
>            raise ValueError(
>              "incomplete parse at offset %d, line=%r" % (offset, line))
>        return fields
> 
> Then read the lines of the file and split them into fields:
> 
>      row = []
>      with open(datafilename) as f:
>        for line in f:
>          fields = split_line(line)
>          rows.append(fields)
> 
> So basicly you're writing a little parser. If you have nested brackets
> things get harder.
> 
You can simplify that somewhat to this:

import re
rows = []

with open(datafilename) as f:
     for line in f:
         rows.append(re.findall(r'(\([^)]*\)|(?=.)[^,\n]*),?', line))



More information about the Python-list mailing list