RE Help splitting CVS data

Roy Smith roy at panix.com
Mon Jan 21 01:00:50 CET 2013


In article <3e1e8567-b9f4-446a-8a59-75f45367d2ac at googlegroups.com>,
 Garry <ggkraemer at gmail.com> wrote:

> Actual data:
> [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
> [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
> 
> code snippet follows:
> 
> import os
> import re
> #I'm using the following regex in an attempt to decode the data:

First suggestion, don't try to parse CSV data with regex.  I'm a huge 
regex fan, but it's just the wrong tool for this job.  Use the built-in 
csv module (http://docs.python.org/2/library/csv.html).  Or, if you want 
something fancier, read_csv() from pandas (http://tinyurl.com/ajxdxjm).

Second, when you use regexes, *always* use raw strings around the 
pattern:

RegExp2 = r'....'

Lastly, take a look at the re.VERBOSE flag.  It lets you write monster 
regexes split up into several lines.  Between re.VERBOSE and raw 
strings, it can make the difference between line noise like this:

> RegExp2 = 
> "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d
> {,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"

and something that mere mortals can understand.



More information about the Python-list mailing list