[Csv] multi-character delimiters, take two
Dave Cole
djc at object-craft.com.au
Sun Feb 9 01:56:46 CET 2003
>>>>> "Andrew" == Andrew McNamara <andrewm at object-craft.com.au> writes:
>> Here's the result. Inputs look like this:
>>
>> "842" "6Feb2003" "16:22:42" "ce0" "log" "drop" "1433"
>> "pD955C67D.dip.t-dialin.net" "stonewall" "2" "" "843" "6Feb2003"
>> "16:25:21" "ce0" "log" "drop" "325" "powered.by.bgames.be"
>> "129.105.117.83" "" " th_flags 14 message_info TCP packet out of
>> state" "844" "6Feb2003" "16:28:13" "ce0" "log" "drop" "nbname"
>> "200.212.86.130" "stonewall" "2" ""
Andrew> Everything is quoted? Then this will work like a charm:
Andrew> line[1:-1].split('" "')
>> It didn't actually skip the space, but the data is fairly regular,
>> so I can live with it.
Andrew> Okay - looks like the skipinitialspace stuff needs more
Andrew> testing - I doubt Dave coded it with delimiter=' ' in mind -
Andrew> it's a pretty pathological case... 8-)
It might be as simple as swapping the following tests:
case START_FIELD:
:
:
else if (c == self->dialect.delimiter) {
/* save empty field */
parse_save_field(self);
}
else if (c == ' ' && self->dialect.skipinitialspace)
/* ignore space at start of field */
;
The state machine for handling multi-character delimiters is not
necessarily much more compilcated. Instead of switching to new state
on the basis of a single character, the state machine would have to
introduce transitional states which iterate over the multi-character
delimiter before going to the destination state.
There would have to be some very basic backtracking which allowed the
parser state machine to indicate a false match of delimiter in the
transitional state. This would rewind the input stream (careful about
infinite loops).
Looking at the state machine for code which reacts to the delimiter.
We would need the following transitional states.
DELIMITER_START_FIELD
DELIMITER_ESCAPED_CHAR
DELIMITER_IN_FIELD
DELIMITER_ESCAPE_IN_QUOTED_FIELD
DELIMITER_QUOTE_IN_QUOTED_FIELD
Mind you all of this code falls over once you decide to allow multiple
characters in the quotechar as well. What happens when
delimiter = 'DD' and quotechar = 'DQ' (where D and Q are some
arbitrary character)? You start building a partial regex engine.
- Dave
--
http://www.object-craft.com.au
More information about the Csv
mailing list