[Tutor] Handling missing fields in a csv file
Dave Angel
davea at ieee.org
Tue Sep 29 23:04:57 CEST 2009
Eduardo Vieira wrote:
> Hello, I have a csv file,
a broken csv file
> using the ";" as a delimiter. This file
> contains addresses. My problem is that some fields are missing in some
> rows and I would like to normalize the rows for a smoother import into
> Excel, for example.
> Here is an example. This is the header:
> Company;Telephone;Address;Prov;PCode
> While most of them have this header, some data would be like this:
> Abc blaba;403-403-4545;MB ---> missing address, city, and postal code
> Acme;123-403-4545;Winnipeg;MB;
> I think a good solution would be to add delimiter to represent empty fields:
> Abc blaba;403-403-4545;;;MB; -->missing address and postal code
> Acme;123-403-4545;;Winnipeg;MB;
>
> Fortunately the source has province names abbreviated (2 letters). I
> could also take into account a postal code, maybe:
> Given I have 2 simple functions:
> isProvince()
> isPostalCode():
> How I would write to the proper fields once that was returned true?
> Province has to go to row[3], and PCode to row[4] right?
>
>
> Eduardo
>
>
On any problem of this type, the toughest part is coming up with the
spec. And sometimes you don't really know it till you've run all
possible data through a program that analyzes it. If the raw data is
available to you, I'd suggest you start there. And if it was converted
to this file, and you no longer have the raw data, then at least analyze
the program that did the faulty conversion. And if that's not possible,
at least plan for your conversion program to do enough error analysis to
detect when the data does not meet the assumptions.
Let me make a guess about the data, and then the program will write itself.
(Guessing) You have a file consisting of text lines. Each line has
between two and five fields, separated by semicolon, with no semicolon
appearing inside any of the fields. None of the fields is "quoted" so
parsing is simply a matter of splitting by the semicolons.
Each field may be blank. Multiple semicolons indicates a blank field
between them. The exhaustive list of fields and missing fields is below.
Company Telephone Address Prov PCode (nothing missing)
Company Telephone Address Prov
Company Telephone Address
Company Telephone
Company Telephone Prov PCode
Company Telephone PCode
Company Telephone Prov
Company Telephone Address PCode
You have a finite list of valid Prov, so isProvince() is reliable, and
you have a reliable algorithm for PCode, so isPostalCode() is reliable.
In other words, no Address will pass isPostalCode(), no PCode will pass
isProvince(), and so on.
So, your algorithm: Read the file, one line at a time, and split the
line into two to five fields, in a list.
If the length of the list is less than 2, report error and quit. If the
length is 2, append three blank fields.
If item2 is a province, insert a blank field just before it. if item3
is a postalcode, insert a blank field just before it
If the (new) length of the list is 5, output the list and proceed to the
next line. Otherwise report an error and quit.
DaveA
More information about the Tutor
mailing list