From iermilov at informatik.uni-leipzig.de  Thu Dec  6 00:40:49 2012
From: iermilov at informatik.uni-leipzig.de (Ivan Ermilov)
Date: Wed, 05 Dec 2012 23:40:49 -0000
Subject: [Csv] csv sniffer - incorrect dialect identification
Message-ID: <50BFDA3D.3080106@informatik.uni-leipzig.de>

Hello everybody,

currently I've got a task of converting ~9500 CSV files to RDF (corpus 
extracted from publicdata.eu portal) and I use python csv module to 
extract headers from csv file.
I tried to use sniff method as in the next example:
>         with open(self.resource_dir + self.filename, 'rU') as csvfile:
>             dialect = csv.Sniffer().sniff(csvfile.read(1024))
>             csvfile.seek(0)
>             reader = csv.reader(csvfile, dialect)
>             try:
>                 for row in reader:
>                     return row
>             except BaseException as e:
>                 print str(e)
>                 return []
But it fails to determine comma ',' as a delimiter in some cases (for 
instance, it can take 'i' as a delimiter, which is nonsense in 
real-world applications). This is really bad, because comma delimiter is 
the most frequently used one and should be determined without mistake.

If I know which delimiters are possible in my corpus is there a way to 
tell sniffer to choose between them?

Kind regards,
Ivan Ermilov.

From tony at tony.gen.nz  Sat Dec 22 05:22:55 2012
From: tony at tony.gen.nz (Tony Wallace)
Date: Sat, 22 Dec 2012 17:22:55 +1300
Subject: [Csv] csv sniffer - incorrect dialect identification
In-Reply-To: <50BFDA3D.3080106@informatik.uni-leipzig.de>
References: <50BFDA3D.3080106@informatik.uni-leipzig.de>
Message-ID: <50D5359F.8040602@tony.gen.nz>

If I were importing 9500 CSV files generated as output from a single 
database I would not even try to use dialect detection.  Better to 
determine what the correct dialect is and parse it with a statically 
assigned dialect.  This dialect could be stored in your application 
metadata or assigned in code.

The reason is that in handling production data quantities there are 
always a few records that trip up code or detection algorithms. Better 
to find out what the gotcha's are and deal with them once and for all.

Tony