converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Sun May 20 06:52:11 EDT 2007

On May 20, 2:16 am, John Machin <sjmac... at lexicon.net> wrote:
> On 19/05/2007 3:14 PM, Paddy wrote:
>
>
>
> > On May 19, 12:07 am, py_genetic <conor.robin... at gmail.com> wrote:
> >> Hello,
>
> >> I'm importing large text files of data using csv.  I would like to add
> >> some more auto sensing abilities.  I'm considing sampling the data
> >> file and doing some fuzzy logic scoring on the attributes (colls in a
> >> data base/ csv file, eg. height weight income etc.) to determine the
> >> most efficient 'type' to convert the attribute coll into for further
> >> processing and efficient storage...
>
> >> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
> >> there' '100,000,000,000'], [next row...] ....]
>
> >> Aside from a missing attribute designator, we can assume that the same
> >> type of data continues through a coll.  For example, a string, int8,
> >> int16, float etc.
>
> >> 1. What is the most efficient way in python to test weather a string
> >> can be converted into a given numeric type, or left alone if its
> >> really a string like 'A' or 'hello'?  Speed is key?  Any thoughts?
>
> >> 2. Is there anything out there already which deals with this issue?
>
> >> Thanks,
> >> Conor
>
> > You might try investigating what can generate your data. With luck,
> > it could turn out that the data generator is methodical and column
> > data-types are consistent and easily determined by testing the
> > first or second row. At worst, you will get to know how much you
> > must check for human errors.
>
> Here you go, Paddy, the following has been generated very methodically;
> what data type is the first column? What is the value in the first
> column of the 6th row likely to be?
>
> "$39,082.00","$123,456.78"
> "$39,113.00","$124,218.10"
> "$39,141.00","$124,973.76"
> "$39,172.00","$125,806.92"
> "$39,202.00","$126,593.21"
>
> N.B. I've kindly given you five lines instead of one or two :-)
>
> Cheers,
> John

John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

- Paddy.