fixing an horrific formatted csv file.

MRAB python at mrabarnett.plus.com
Tue Jul 1 16:32:23 CEST 2014


On 2014-07-01 15:04, flebber wrote:
> What I am trying to do is to reformat a csv file into something more usable.
> currently the file has no headers, multiple lines with varying columns that are not related.
>
> This is a sample
>
> Meeting,05/07/14,RHIL,Rosehill Gardens,Weights,TAB,+3m Entire Circuit,          ,
> Race,1,CIVIC STAKES,CIVIC,CIVIC,1350,~         ,3U        ,~         ,QLT       ,54,0,0,5/07/2014,,          ,          ,          ,          ,No class restriction, Quality, For Three-Years-Old and Upwards, No sex restriction, (Listed),Of $100000. First $60000, second $20000, third $10000, fourth $5000, fifth $2000, sixth $1000, seventh $1000, eighth $1000
> Horse,1,Bennetta,0,"Grahame Begg",Randwick,,0,0,16-3-1-3 $390450.00,,0,0,0,,98.00,M,
> Horse,2,Breakfast in Bed,0,"David Vandyke",Warwick Farm,,0,0,20-6-1-5 $201250.00,,0,0,0,,81.00,M,
> Horse,3,Capital Commander,0,"Gerald Ryan",Rosehill,,0,0,43-9-9-3 $438625.00,,0,0,0,,85.00,M,
> Horse,4,Coup Ay Tee (NZ),0,"Chris Waller",Rosehill,,0,0,35-9-6-5 $519811.00,,0,0,0,,101.00,G,
> Horse,5,Generalife,0,"John O'Shea",Warwick Farm,,0,0,19-6-1-3 $235045.00,,0,0,0,,87.00,G,
> Horse,6,He's Your Man (FR),0,"Chris Waller",Rosehill,,0,0,13-2-3-1 $108110.00,,0,0,0,,93.00,G,
> Horse,7,Hidden Kisses,0,"Chris Waller",Rosehill,,0,0,40-8-8-5 $565750.00,,0,0,0,,96.00,M,
> Horse,8,Oakfield Commands,0,"Gerald Ryan",Rosehill,,0,0,22-7-4-6 $269530.00,,0,0,0,,94.00,G,
> Horse,9,Taxmeifyoucan,0,"Gregory Hickman",Warwick Farm,,0,0,18-2-4-4 $539730.00,,0,0,0,,91.00,G,
> Horse,10,The Peak,0,"Bart & James Cummings",Randwick,,0,0,15-6-1-0 $426732.00,,0,0,0,,95.00,G,
> Horse,11,Tougher Than Ever (NZ),0,"Chris Waller",Rosehill,,0,0,17-3-2-3 $321613.00,,0,0,0,,97.00,H,
> Horse,12,TROMSO,0,"Chris Waller",Rosehill,,0,0,47-8-11-2 $622300.00,,0,0,0,,103.00,G,
> Race,2,FLYING WELTER - BENCHMARK 95 HCP,BM95,BM95,1100,BM95      ,3U        ,~         ,HCP       ,54,0,0,5/07/2014,,          ,          ,          ,          ,BenchMark 95, Handicap, For Three-Years-Old and Upwards, No sex restriction,Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000
> Horse,1,Big Bonanza,0,"Don Robb",Wyong,,0,57.5,31-9-4-3 $366860.00,,0,0,0,,92.00,G,
> Horse,2,Casual Choice,0,"Joseph Pride",Warwick Farm,,0,54,8-2-3-0 $105930.00,,0,0,0,
>
> So what I am trying to so is end up with an output like this.
>
> Meeting, Date, Race, Number, Name, Trainer, Location
> Rosehill, 05/07/14, 1, 1,Bennetta,"Grahame Begg",Randwick,
> Rosehill, 05/07/14, 1, 2,Breakfast in Bed,"David Vandyke",Warwick Farm,
>
> So as a start i thought i would try inserting the Meeting and Race number however I am just not getting it right.
>
> import csv
>
> outfile = open("/home/sayth/Scripts/cleancsv.csv", "w")
> with open('/home/sayth/Scripts/test.csv') as f:
>      f_csv = csv.reader(f)
>      headers = next(f_csv)
>      for row in f_csv:
>          meeting = row[3] in row[0] == 'Meeting'
>          new = row.insert(0, meeting)
>          while row[1] in row[0] == 'Race' < 9:  # pref less than next found row[0]
>
>              # grab row[1] as id number
>              id = row[1]
>              # from row[0] and insert it in first position
>              new_lines = new.insert(1, id)
>              outfile.write(new_lines)
>          outfile.close()
>
> How should I go about this?
>
There's no point in reading the first row as the headers because it
clearly doesn't contain just the headings.

First write a row for the header.

Then, for each row:

If the first field is 'Meeting', then remember the meeting, etc.

If the first field is 'Race', then remember the race, etc.

If the first field is 'Horse', then write the row with the additional
fields for race, etc.

And so on.

BTW, the indentation for the 'outfile.close()' line is wrong. It would,
of course, be better to use the 'with' statement for that file too.




More information about the Python-list mailing list