[scikit-learn] Replacing the Boston Housing Prices dataset

Thu Jul 6 16:34:51 EDT 2017

Hi Tony

As others have pointed out, I think that you may be misunderstanding the
purpose of that "feature." We are in agreement that discrimination against
protected classes is not OK, and that even outside complying with the law
one should avoid discrimination, in model building or elsewhere. However, I
disagree that one does this by eliminating from all datasets any feature
that may allude to these protected classes. As Andreas pointed out, there
is a growing effort to ensure that machine learning models are fair and
benefit the common good (such as FATML, DSSG, etc..), and from my
understanding the general consensus isn't necessarily that simply
eliminating the feature is sufficient. I think we are in agreement that
naively learning a model over a feature set containing questionable
features and calling it a day is not okay, but as others have pointed out,
having these features present and handling them appropriately can help
guard against the model implicitly learning unfair biases (even if they are
not explicitly exposed to the feature).

I would welcome the addition of the Ames dataset to the ones supported by
sklearn, but I'm not convinced that the Boston dataset should be removed.
As Andreas pointed out, there is a benefit to having canonical examples
present so that beginners can easily follow along with the many tutorials
that have been written using them. As Sean points out, the paper itself is
trying to pull out the connection between house price and clean air in the
presence of possible confounding variables. In a more general sense, saying
that a feature shouldn't be there because a simple linear regression is
unaffected by the results is a bit odd because it is very common for
datasets to include irrelevant features, and handling them appropriately is
important. In addition, one could argue that having this type of issue
arise in a toy dataset has a benefit because it exposes these types of
issues to those learning data science earlier on and allows them to keep
these issues in mind in the future when the data is more serious.

It is important for us all to keep issues of fairness in mind when it comes
to data science. I'm glad that you're speaking out in favor of fairness and
trying to bring attention to it.

Jacob

On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com>
wrote:

> G Reina
> you make a bizarre argument. You argue that you should not even check
> racism as a possible factor in house prices?
>
> But then you yourself check whether its relevant
> Then you say
>
> "but I'd argue that it's more due to the location (near water, near
> businesses, near restaurants, near parks and recreation) than to the ethnic
> makeup"
>
> Which  was basically what  the original authors wanted to show too,
>
> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean
> air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>
>  but unless you measure ethnic make-up you cannot show that it is not a
> confounder.
>
> The term "white flight" refers to affluent white families moving to the
> suburbs.. And clearly a question is whether/how much was racism or avoiding
> air pollution.
>
>
>
>
>
> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>
>> I'd like to request that the "Boston Housing Prices" dataset in sklearn
>> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
>> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
>> willing to submit the code change if the developers agree.
>>
>> The Boston dataset has the feature "Bk is the proportion of blacks in
>> town". It is an incredibly racist "feature" to include in any dataset. I
>> think is beneath us as data scientists.
>>
>> I submit that the Ames dataset is a viable alternative for learning
>> regression. The author has shown that the dataset is a more robust
>> replacement for Boston. Ames is a 2011 regression dataset on housing prices
>> and has more than 5 times the amount of training examples with over 7 times
>> as many features (none of which are morally questionable).
>>
>> I welcome the community's thoughts on the matter.
>>
>> Thanks.
>> -Tony
>>
>> Here's an article I wrote on the Boston dataset:
>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-
>> anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_
>> feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/9c97383e/attachment-0001.html>