[scikit-learn] Replacing the Boston Housing Prices dataset

G Reina greina at eng.ucsd.edu
Thu Jul 6 12:05:38 EDT 2017


I'd like to request that the "Boston Housing Prices" dataset in sklearn
(sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
willing to submit the code change if the developers agree.

The Boston dataset has the feature "Bk is the proportion of blacks in
town". It is an incredibly racist "feature" to include in any dataset. I
think is beneath us as data scientists.

I submit that the Ames dataset is a viable alternative for learning
regression. The author has shown that the dataset is a more robust
replacement for Boston. Ames is a 2011 regression dataset on housing prices
and has more than 5 times the amount of training examples with over 7 times
as many features (none of which are morally questionable).

I welcome the community's thoughts on the matter.

Thanks.
-Tony

Here's an article I wrote on the Boston dataset:
https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/6dab8aa9/attachment.html>


More information about the scikit-learn mailing list