Using csv.DictReader with \r\n in the middle of fields
pstatham at sefas.com
Thu Oct 14 10:48:41 CEST 2010
On Oct 13, 4:01 pm, Neil Cerutti <ne... at norwich.edu> wrote:
> On 2010-10-13, pstatham <pstat... at sefas.com> wrote:
> > Hopefully this will interest some, I have a csv file (can be
> > downloaded fromhttp://www.paulstathamphotography.co.uk/45.txt) which
> > has five fields separated by ~ delimiters. To read this I've been
> > using a csv.DictReader which works in 99% of the cases. Occasionally
> > however the description field has errant \r\n characters in the middle
> > of the record. This causes the reader to assume it's a new record and
> > try to read it.
> Here's an alternative idea. Working with csv module for this job
> is too difficult for me. ;)
> import re
> record_re = "(?P<PROGTITLE>.*?)~(?P<SUBTITLE>.*?)~(?P<EPISODE>.*?)~(?P<DESCRIPTION>.*?)~(?P<DATE>.*?)\n(.*)"
> def parse_file(fname):
> with open(fname) as f:
> data = f.read()
> m = re.match(record_re, data, flags=re.M | re.S)
> while m:
> yield m.groupdict()
> m = re.match(record_re, m.group(6), flags=re.M | re.S)
> for record in parse_file('45.txt'):
> Neil Cerutti
Thanks guys, I can't alter the source data.
I wouldn't of considered regex, but it's a good idea as I can then
define my own record structure instead of reader dictating to me what
a record is.
More information about the Python-list