[scikit-learn] Replacing the Boston Housing Prices dataset
Bill Ross
ross at cgl.ucsf.edu
Sun Jul 9 20:13:47 EDT 2017
Possibly of interest:
Race and ethnicity Imputation from Disease history with Deep LEarning
https://github.com/jisungk/riddle
Bill
On 7/6/17 6:00 PM, Bill Ross wrote:
> Unless the data concretely promotes discrimination, it seems
> discriminatory to exclude it.
>
> Bill
>
> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>> I think there can be some middle ground. I.e., adding a new, simple
>> dataset to demonstrate regression (maybe autmpg, wine quality, or sth
>> like that) and use that for the scikit-learn examples in the main
>> documentation etc but leave the boston dataset in the code base for
>> now. Whether it's a weak argument or not, it would be quite
>> destructive to remove the dataset altogether in the next version or
>> so, not only because old tutorials use it but many unit tests in many
>> different projects depend on it. I think it might be better to phase
>> it out by having a good alternative first, and I am sure that the
>> scikit-learn maintainers wouldn't have anything against it if someone
>> would update the examples/tutorials with the use of different datasets
>>
>> Best,
>> Sebastian
>>
>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.soma at gmail.com>
>>> wrote:
>>>
>>> For what it's worth: I'm sympathetic to the argument that you can't
>>> fix the problem if you don't measure it, but I agree with Tony that
>>> "many tutorials use it" is an extremely weak argument. We removed
>>> Lena from scikit-image because it was the right thing to do. I very
>>> much doubt that Boston house prices is in more widespread use than
>>> Lena was in image processing.
>>>
>>> You can argue about whether or not it's morally right or wrong to
>>> include the dataset. I see merit to both arguments. But "too many
>>> tutorials use it" is very similar in flavour to "the economy of the
>>> South would collapse without slavery."
>>>
>>> Regarding fair uses of the feature, I would hope that all sklearn
>>> tutorials using the dataset mention such uses. The potential for
>>> abuse and misinterpretation is enormous.
>>>
>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber
>>> <jmschreiber91 at gmail.com>, wrote:
>>>> Hi Tony
>>>>
>>>> As others have pointed out, I think that you may be
>>>> misunderstanding the purpose of that "feature." We are in agreement
>>>> that discrimination against protected classes is not OK, and that
>>>> even outside complying with the law one should avoid
>>>> discrimination, in model building or elsewhere. However, I disagree
>>>> that one does this by eliminating from all datasets any feature
>>>> that may allude to these protected classes. As Andreas pointed out,
>>>> there is a growing effort to ensure that machine learning models
>>>> are fair and benefit the common good (such as FATML, DSSG, etc..),
>>>> and from my understanding the general consensus isn't necessarily
>>>> that simply eliminating the feature is sufficient. I think we are
>>>> in agreement that naively learning a model over a feature set
>>>> containing questionable features and calling it a day is not okay,
>>>> but as others have pointed out, having these features present and
>>>> handling them appropriately can help guard against the model
>>>> implicitly learning unfair!
> !
>> biases (e
>> ven if they are not explicitly exposed to the feature).
>>>> I would welcome the addition of the Ames dataset to the ones
>>>> supported by sklearn, but I'm not convinced that the Boston dataset
>>>> should be removed. As Andreas pointed out, there is a benefit to
>>>> having canonical examples present so that beginners can easily
>>>> follow along with the many tutorials that have been written using
>>>> them. As Sean points out, the paper itself is trying to pull out
>>>> the connection between house price and clean air in the presence of
>>>> possible confounding variables. In a more general sense, saying
>>>> that a feature shouldn't be there because a simple linear
>>>> regression is unaffected by the results is a bit odd because it is
>>>> very common for datasets to include irrelevant features, and
>>>> handling them appropriately is important. In addition, one could
>>>> argue that having this type of issue arise in a toy dataset has a
>>>> benefit because it exposes these types of issues to those learning
>>>> data science earlier on and allows them to keep these issues in
>>>> mind in the futur!
> e!
>> when the
>> data is more serious.
>>>> It is important for us all to keep issues of fairness in mind when
>>>> it comes to data science. I'm glad that you're speaking out in
>>>> favor of fairness and trying to bring attention to it.
>>>>
>>>> Jacob
>>>>
>>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante
>>>> <sean.violante at gmail.com> wrote:
>>>> G Reina
>>>> you make a bizarre argument. You argue that you should not even
>>>> check racism as a possible factor in house prices?
>>>>
>>>> But then you yourself check whether its relevant
>>>> Then you say
>>>>
>>>> "but I'd argue that it's more due to the location (near water, near
>>>> businesses, near restaurants, near parks and recreation) than to
>>>> the ethnic makeup"
>>>>
>>>> Which was basically what the original authors wanted to show too,
>>>>
>>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for
>>>> clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>>>>
>>>> but unless you measure ethnic make-up you cannot show that it is
>>>> not a confounder.
>>>>
>>>> The term "white flight" refers to affluent white families moving to
>>>> the suburbs.. And clearly a question is whether/how much was racism
>>>> or avoiding air pollution.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>>>> I'd like to request that the "Boston Housing Prices" dataset in
>>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames
>>>> Housing Prices" dataset
>>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
>>>> willing to submit the code change if the developers agree.
>>>>
>>>> The Boston dataset has the feature "Bk is the proportion of blacks
>>>> in town". It is an incredibly racist "feature" to include in any
>>>> dataset. I think is beneath us as data scientists.
>>>>
>>>> I submit that the Ames dataset is a viable alternative for learning
>>>> regression. The author has shown that the dataset is a more robust
>>>> replacement for Boston. Ames is a 2011 regression dataset on
>>>> housing prices and has more than 5 times the amount of training
>>>> examples with over 7 times as many features (none of which are
>>>> morally questionable).
>>>>
>>>> I welcome the community's thoughts on the matter.
>>>>
>>>> Thanks.
>>>> -Tony
>>>>
>>>> Here's an article I wrote on the Boston dataset:
>>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170709/b9e43d33/attachment.html>
More information about the scikit-learn
mailing list