CSV reader ignore brackets
MRAB
python at mrabarnett.plus.com
Tue Sep 24 19:50:40 EDT 2019
On 2019-09-25 00:09, Cameron Simpson wrote:
> On 24Sep2019 15:55, Mihir Kothari <mihir.kothari at gmail.com> wrote:
>>I am using python 3.4. I have a CSV file as below:
>>
>>ABC,PQR,(TEST1,TEST2)
>>FQW,RTE,MDE
>
> Really? No quotes around the (TEST1,TEST2) column value? I would have
> said this is invalid data, but that does not help you.
>
>>Basically comma-separated rows, where some rows have a data in column which
>>is array like i.e. in brackets.
>>So I need to read the file and treat such columns as one i.e. do not
>>separate based on comma if it is inside the bracket.
>>
>>In short I need to read a CSV file where separator inside the brackets
>>needs to be ignored.
>>
>>Output:
>>Column: 1 2 3
>>Row1: ABC PQR (TEST1,TEST2)
>>Row2: FQW RTE MDE
>>
>>Can you please help with the snippet?
>
> I would be reaching for a regular expression. If you partition your
> values into 2 types: those starting and ending in a bracket, and those
> not, you could write a regular expression for the former:
>
> \([^)]*\)
>
> which matches a string like (.....) (with, importantly, no embedded
> brackets, only those at the beginning and end.
>
> And you can write a regular expression like:
>
> [^,]*
>
> for a value containing no commas i.e. all the other values.
>
> Test the bracketed one first, because the second one always matches
> something.
>
> Then you would not use the CSV module (which expects better formed data
> than you have) and instead write a simple parser for a line of text
> which tries to match one of these two expressions repeatedly to consume
> the line. Something like this (UNTESTED):
>
> bracketed_re = re.compile(r'\([^)]*\)')
> no_commas_re = re.compile(r'[^,]*')
>
> def split_line(line):
> line = line.rstrip() # drop trailing whitespace/newline
> fields = []
> offset = 0
> while offset < len(line):
> m = bracketed_re.match(line, offset)
> if m:
> field = m.group()
> else:
> m = no_commas_re.match(line, offset) # this always matches
> field = m.group()
> fields.append(field)
> offset += len(field)
> if line.startswith(',', offset):
> # another column
> offset += 1
> elif offset < len(line):
> raise ValueError(
> "incomplete parse at offset %d, line=%r" % (offset, line))
> return fields
>
> Then read the lines of the file and split them into fields:
>
> row = []
> with open(datafilename) as f:
> for line in f:
> fields = split_line(line)
> rows.append(fields)
>
> So basicly you're writing a little parser. If you have nested brackets
> things get harder.
>
You can simplify that somewhat to this:
import re
rows = []
with open(datafilename) as f:
for line in f:
rows.append(re.findall(r'(\([^)]*\)|(?=.)[^,\n]*),?', line))
More information about the Python-list
mailing list