[Tutor] Handling missing fields in a csv file

Dave Angel davea at ieee.org
Tue Sep 29 23:04:57 CEST 2009


Eduardo Vieira wrote:
> Hello, I have a csv file,
a broken csv file
>  using the ";" as a delimiter. This file
> contains addresses. My problem is that some fields are missing in some
> rows and I would like to normalize the rows for a smoother import into
> Excel, for example.
> Here is an example. This is the header:
> Company;Telephone;Address;Prov;PCode
> While most of them have this header, some data would be like this:
> Abc blaba;403-403-4545;MB ---> missing address, city, and postal code
> Acme;123-403-4545;Winnipeg;MB;
> I think a good solution would be to add delimiter to represent empty fields:
> Abc blaba;403-403-4545;;;MB; -->missing address and postal code
> Acme;123-403-4545;;Winnipeg;MB;
>
> Fortunately the source has province names abbreviated (2 letters). I
> could also take into account a postal code, maybe:
> Given I have 2 simple functions:
> isProvince()
> isPostalCode():
> How I would write to the proper fields once that was returned true?
> Province has to go to row[3], and PCode to row[4] right?
>
>
> Eduardo
>
>   
On any problem of this type, the toughest part is coming up with the 
spec.  And sometimes you don't really know it till you've run all 
possible data through a program that analyzes it.  If the raw data is 
available to you, I'd suggest you start there.  And if it was converted 
to this file, and you no longer have the raw data, then at least analyze 
the program that did the faulty conversion.  And if that's not possible, 
at least plan for your conversion program to do enough error analysis to 
detect when the data does not meet the assumptions.


Let me make a guess about the data, and then the program will write itself.

(Guessing)  You have a file consisting of text lines.  Each line has 
between two and five fields, separated by semicolon, with no semicolon 
appearing inside any of the fields.  None of the fields is "quoted" so 
parsing is simply a matter of splitting by the semicolons.

Each field may be blank.  Multiple semicolons indicates a blank field 
between them.  The exhaustive list of fields and missing fields is below.

Company  Telephone     Address   Prov   PCode  (nothing missing)


Company  Telephone     Address   Prov


Company  Telephone     Address

Company  Telephone
Company  Telephone     Prov  PCode


Company  Telephone     PCode

Company  Telephone     Prov  

Company  Telephone     Address   PCode


You have a finite list of valid Prov, so isProvince() is reliable, and 
you have a reliable algorithm for PCode, so isPostalCode() is reliable.  
In other words, no Address will pass isPostalCode(), no PCode will pass 
isProvince(), and so on.

So, your algorithm:   Read the file, one line at a time, and split the 
line into two to five fields, in a list.
If the length of the list is less than 2, report error and quit.  If the 
length is 2, append three blank fields.
If item2 is a province, insert a blank field just before it.  if item3 
is a postalcode, insert a blank field just before it
If the (new) length of the list is 5, output the list and proceed to the 
next line.  Otherwise report an error and quit.

DaveA



More information about the Tutor mailing list