[Tutor] Handling missing fields in a csv file

Wed Sep 30 22:32:15 CEST 2009

On Tue, Sep 29, 2009 at 3:04 PM, Dave Angel <davea at ieee.org> wrote:
> Eduardo Vieira wrote:
>>
>> Hello, I have a csv file,
>
> a broken csv file
>>
>>  using the ";" as a delimiter. This file
>> contains addresses. My problem is that some fields are missing in some
>> rows and I would like to normalize the rows for a smoother import into
>> Excel, for example.
>> Here is an example. This is the header:
>> Company;Telephone;Address;Prov;PCode
>> While most of them have this header, some data would be like this:
>> Abc blaba;403-403-4545;MB ---> missing address, city, and postal code
>> Acme;123-403-4545;Winnipeg;MB;
>> I think a good solution would be to add delimiter to represent empty
>> fields:
>> Abc blaba;403-403-4545;;;MB; -->missing address and postal code
>> Acme;123-403-4545;;Winnipeg;MB;
>>
>> Fortunately the source has province names abbreviated (2 letters). I
>> could also take into account a postal code, maybe:
>> Given I have 2 simple functions:
>> isProvince()
>> isPostalCode():
>> How I would write to the proper fields once that was returned true?
>> Province has to go to row[3], and PCode to row[4] right?
>>
>>
>> Eduardo
>>
>>
>
> On any problem of this type, the toughest part is coming up with the spec.
>  And sometimes you don't really know it till you've run all possible data
> through a program that analyzes it.  If the raw data is available to you,
> I'd suggest you start there.  And if it was converted to this file, and you
> no longer have the raw data, then at least analyze the program that did the
> faulty conversion.  And if that's not possible, at least plan for your
> conversion program to do enough error analysis to detect when the data does
> not meet the assumptions.
>
>
> Let me make a guess about the data, and then the program will write itself.
>
> (Guessing)  You have a file consisting of text lines.  Each line has between
> two and five fields, separated by semicolon, with no semicolon appearing
> inside any of the fields.  None of the fields is "quoted" so parsing is
> simply a matter of splitting by the semicolons.
>
> Each field may be blank.  Multiple semicolons indicates a blank field
> between them.  The exhaustive list of fields and missing fields is below.
>
> Company  Telephone     Address   Prov   PCode  (nothing missing)
>
>
> Company  Telephone     Address   Prov
>
>
> Company  Telephone     Address
>
> Company  Telephone
> Company  Telephone     Prov  PCode
>
>
> Company  Telephone     PCode
>
> Company  Telephone     Prov
> Company  Telephone     Address   PCode
>
>
> You have a finite list of valid Prov, so isProvince() is reliable, and you
> have a reliable algorithm for PCode, so isPostalCode() is reliable.  In
> other words, no Address will pass isPostalCode(), no PCode will pass
> isProvince(), and so on.
>
> So, your algorithm:   Read the file, one line at a time, and split the line
> into two to five fields, in a list.
> If the length of the list is less than 2, report error and quit.  If the
> length is 2, append three blank fields.
> If item2 is a province, insert a blank field just before it.  if item3 is a
> postalcode, insert a blank field just before it
> If the (new) length of the list is 5, output the list and proceed to the
> next line.  Otherwise report an error and quit.
>
> DaveA
>
>
Thank you for your insight Dave! I will explore your solution.
Eduardo