<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi Tony.<br>
<br>
I don't think it's a good idea to remove the dataset, given how many
tutorials and examples rely on it.<br>
I also don't think it's a good idea to ignore racial discrimination,
which I guess this feature is trying to capture.<br>
<br>
I was recently asked to remove an excerpt from a dataset from my
slide, as it was "too racist". It was randomly sampled<br>
data from the adult census dataset. Unfortunately, economics in the
US are not color blind (yet), and the reality is racist.<br>
I haven't done an in-depth analysis on whether this feature is
actually informative, but I don't think your analysis is conclusive.<br>
<br>
Including ethnicity in data actually allows us to ensure "fairness"
in certain decision making processes.<br>
Without collecting this data, it would be impossible to ensure
automatic decisions are not influenced<br>
by past human biases. Arguably that's not what the authors of this
dataset are doing.<br>
<br>
Check out <a class="moz-txt-link-freetext" href="http://www.fatml.org/">http://www.fatml.org/</a> for more on fairness in machine
learning and data science.<br>
<br>
Cheers,<br>
Andy<br>
<br>
<br>
<div class="moz-cite-prefix">On 07/06/2017 12:05 PM, G Reina wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>I'd like to request that the "Boston Housing
Prices" dataset in sklearn
(sklearn.datasets.load_boston) be replaced with the
"Ames Housing Prices" dataset (<a
href="https://ww2.amstat.org/publications/jse/v19n3/decock.pdf"
moz-do-not-send="true">https://ww2.amstat.org/publications/jse/v19n3/decock.pdf</a>).
I am willing to submit the code change if the
developers agree.<br>
<br>
</div>
The Boston dataset has the feature "Bk is the proportion
of blacks in town". It is an incredibly racist "feature"
to include in any dataset. I think is beneath us as data
scientists.<br>
<br>
</div>
I submit that the Ames dataset is a viable alternative for
learning regression. The author has shown that the dataset
is a more robust replacement for Boston. Ames is a 2011
regression dataset on housing prices and has more than 5
times the amount of training examples with over 7 times as
many features (none of which are morally questionable). <br>
<br>
</div>
I welcome the community's thoughts on the matter.<br>
<br>
</div>
Thanks.<br>
</div>
-Tony<br>
<br>
Here's an article I wrote on the Boston dataset:<br>
<a
href="https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D"
moz-do-not-send="true">https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D</a><br>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
scikit-learn mailing list
<a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
<a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
</pre>
</blockquote>
<br>
</body>
</html>