Using csv.DictReader with \r\n in the middle of fields

pstatham pstatham at sefas.com
Thu Oct 14 10:48:41 CEST 2010


On Oct 13, 4:01 pm, Neil Cerutti <ne... at norwich.edu> wrote:
> On 2010-10-13, pstatham <pstat... at sefas.com> wrote:
>
> > Hopefully this will interest some, I have a csv file (can be
> > downloaded fromhttp://www.paulstathamphotography.co.uk/45.txt) which
> > has five fields separated by ~ delimiters. To read this I've been
> > using a csv.DictReader which works in 99% of the cases. Occasionally
> > however the description field has errant \r\n characters in the middle
> > of the record. This causes the reader to assume it's a new record and
> > try to read it.
>
> Here's an alternative idea. Working with csv module for this job
> is too difficult for me. ;)
>
> import re
>
> record_re = "(?P<PROGTITLE>.*?)~(?P<SUBTITLE>.*?)~(?P<EPISODE>.*?)~(?P<DESCRIPTION>.*?)~(?P<DATE>.*?)\n(.*)"
>
> def parse_file(fname):
>     with open(fname) as f:
>         data = f.read()
>         m = re.match(record_re, data, flags=re.M | re.S)
>         while m:
>             yield m.groupdict()
>             m = re.match(record_re, m.group(6), flags=re.M | re.S)
>
> for record in parse_file('45.txt'):
>     print(record)
>
> --
> Neil Cerutti

Thanks guys, I can't alter the source data.

I wouldn't of considered regex, but it's a good idea as I can then
define my own record structure instead of reader dictating to me what
a record is.



More information about the Python-list mailing list