[Pandas-dev] CSV Sniffer

Tom Augspurger tom.augspurger88 at gmail.com
Tue Mar 31 16:56:54 EDT 2020


Thanks for reaching out,

Pandas keeps its list of required dependencies small so I don't think we
could
use CleverCSV by default. But we do have many optional dependencies. I
imagine
that we'd be open to an optional dependency on CleverCSV and an option for
using
it instead of `csv.Sniffer`, or users could provide the sniffer as a keyword
argument to read_csv.

If there aren't any objections raised on the mailing list, then I think the
next thing to
do is to open an issue on GitHub with a proposed change to the API for how
users
could optionally use CleverCSV, and then a pull request after the design
details
have been ironed out.

Tom

On Tue, Mar 31, 2020 at 3:38 PM Gerrit van den Burg <
gvandenburg at turing.ac.uk> wrote:

> Dear pandas developers,
>
> I'm the author of CleverCSV, a drop-in replacement for the Python csv
> module with improved accuracy for dialect detection (compared to
> csv.Sniffer). CleverCSV achieves a 22% increase in accuracy over the
> Python Sniffer for non-standard csv files. See these links for more
> context:
>
> - https://towardsdatascience.com/handling-messy-csv-files-2ef829aa441d
> - https://github.com/alan-turing-institute/CleverCSV
>
> Since the pandas module uses the csv Sniffer to detect the dialect in
> the pd.read_csv function with sep=None, I was wondering whether there
> would be any interest in replacing this with CleverCSV. If so, I'd be
> happy to prepare a pull request.
>
> Kind regards,
>
> Gerrit van den Burg
>
> --
> Gerrit J.J. van den Burg
> The Alan Turing Institute
> gertjanvandenburg.com
>
> The Alan Turing Institute is a limited liability company, registered in
> England with registered number 09512457. Our registered office is at
> British Library, 96 Euston Road, London, England, NW1 2DB. We are also a
> charity registered in England with charity number 1162533. DISCLAIMER:
> Although The Alan Turing Institute has taken reasonable precautions to
> ensure no viruses are present in this email, The Alan Turing Institute
> cannot accept responsibility for any loss or damage sustained as a result
> of computer viruses and the recipient must ensure that the email (and
> attachments) are virus free. While we take care to protect our systems from
> virus attacks and other harmful events, we give no warranty that this
> message (including attachments) is free of any virus or other harmful
> matter, and we accept no responsibility for any loss or damage resulting
> from the recipient receiving, opening or using it. E-mail transmission
> cannot be guaranteed to be secure or error-free as information could be
> intercepted, corrupted, lost, destroyed, arrive late or be incomplete. If
> you think someone may have interfered with this email, please contact the
> Alan Turing Institute by telephone only and speak to the person dealing
> with your matter or the Accounts Department. Fraudsters are increasingly
> targeting organisations and their affiliates, often requesting funds to be
> transferred to a different bank account. The Alan Turing’s bank details are
> contained within our terms of engagement. If you receive a suspicious or
> unexpected email from us, or purporting to have been sent on our behalf,
> particularly containing different bank details, please do not reply to the
> email, click on any links, open any attachments, or comply with any
> instructions contained within it, but contact our Accounts department by
> telephone. Our Transparency Notice found here -
> https://www.turing.ac.uk/transparency-notice sets out how and why we
> collect, store, use and share your personal data and it explains your
> rights and how to raise concerns with us.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200331/97e82b63/attachment.html>


More information about the Pandas-dev mailing list