[scikit-learn] Replacing the Boston Housing Prices dataset

Bill Ross ross at cgl.ucsf.edu
Sun Jul 9 20:13:47 EDT 2017


Possibly of interest:

Race and ethnicity Imputation from Disease history with Deep LEarning

https://github.com/jisungk/riddle

Bill

On 7/6/17 6:00 PM, Bill Ross wrote:
> Unless the data concretely promotes discrimination, it seems 
> discriminatory to exclude it.
>
> Bill
>
> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>> I think there can be some middle ground. I.e., adding a new, simple 
>> dataset to demonstrate regression (maybe autmpg, wine quality, or sth 
>> like that) and use that for the scikit-learn examples in the main 
>> documentation etc but leave the boston dataset in the code base for 
>> now. Whether it's a weak argument or not, it would be quite 
>> destructive to remove the dataset altogether in the next version or 
>> so, not only because old tutorials use it but many unit tests in many 
>> different projects depend on it. I think it might be better to phase 
>> it out by having a good alternative first, and I am sure that the 
>> scikit-learn maintainers wouldn't have anything against it if someone 
>> would update the examples/tutorials with the use of different datasets
>>
>> Best,
>> Sebastian
>>
>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.soma at gmail.com> 
>>> wrote:
>>>
>>> For what it's worth: I'm sympathetic to the argument that you can't 
>>> fix the problem if you don't measure it, but I agree with Tony that 
>>> "many tutorials use it" is an extremely weak argument. We removed 
>>> Lena from scikit-image because it was the right thing to do. I very 
>>> much doubt that Boston house prices is in more widespread use than 
>>> Lena was in image processing.
>>>
>>> You can argue about whether or not it's morally right or wrong to 
>>> include the dataset. I see merit to both arguments. But "too many 
>>> tutorials use it" is very similar in flavour to "the economy of the 
>>> South would collapse without slavery."
>>>
>>> Regarding fair uses of the feature, I would hope that all sklearn 
>>> tutorials using the dataset mention such uses. The potential for 
>>> abuse and misinterpretation is enormous.
>>>
>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber 
>>> <jmschreiber91 at gmail.com>, wrote:
>>>> Hi Tony
>>>>
>>>> As others have pointed out, I think that you may be 
>>>> misunderstanding the purpose of that "feature." We are in agreement 
>>>> that discrimination against protected classes is not OK, and that 
>>>> even outside complying with the law one should avoid 
>>>> discrimination, in model building or elsewhere. However, I disagree 
>>>> that one does this by eliminating from all datasets any feature 
>>>> that may allude to these protected classes. As Andreas pointed out, 
>>>> there is a growing effort to ensure that machine learning models 
>>>> are fair and benefit the common good (such as FATML, DSSG, etc..), 
>>>> and from my understanding the general consensus isn't necessarily 
>>>> that simply eliminating the feature is sufficient. I think we are 
>>>> in agreement that naively learning a model over a feature set 
>>>> containing questionable features and calling it a day is not okay, 
>>>> but as others have pointed out, having these features present and 
>>>> handling them appropriately can help guard against the model 
>>>> implicitly learning unfair!
>  !
>>   biases (e
>>   ven if they are not explicitly exposed to the feature).
>>>> I would welcome the addition of the Ames dataset to the ones 
>>>> supported by sklearn, but I'm not convinced that the Boston dataset 
>>>> should be removed. As Andreas pointed out, there is a benefit to 
>>>> having canonical examples present so that beginners can easily 
>>>> follow along with the many tutorials that have been written using 
>>>> them. As Sean points out, the paper itself is trying to pull out 
>>>> the connection between house price and clean air in the presence of 
>>>> possible confounding variables. In a more general sense, saying 
>>>> that a feature shouldn't be there because a simple linear 
>>>> regression is unaffected by the results is a bit odd because it is 
>>>> very common for datasets to include irrelevant features, and 
>>>> handling them appropriately is important. In addition, one could 
>>>> argue that having this type of issue arise in a toy dataset has a 
>>>> benefit because it exposes these types of issues to those learning 
>>>> data science earlier on and allows them to keep these issues in 
>>>> mind in the futur!
> e!
>>    when the
>>    data is more serious.
>>>> It is important for us all to keep issues of fairness in mind when 
>>>> it comes to data science. I'm glad that you're speaking out in 
>>>> favor of fairness and trying to bring attention to it.
>>>>
>>>> Jacob
>>>>
>>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante 
>>>> <sean.violante at gmail.com> wrote:
>>>> G Reina
>>>> you make a bizarre argument. You argue that you should not even 
>>>> check racism as a possible factor in house prices?
>>>>
>>>> But then you yourself check whether its relevant
>>>> Then you say
>>>>
>>>> "but I'd argue that it's more due to the location (near water, near 
>>>> businesses, near restaurants, near parks and recreation) than to 
>>>> the ethnic makeup"
>>>>
>>>> Which  was basically what  the original authors wanted to show too,
>>>>
>>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for 
>>>> clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>>>>
>>>>   but unless you measure ethnic make-up you cannot show that it is 
>>>> not a confounder.
>>>>
>>>> The term "white flight" refers to affluent white families moving to 
>>>> the suburbs.. And clearly a question is whether/how much was racism 
>>>> or avoiding air pollution.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>>>> I'd like to request that the "Boston Housing Prices" dataset in 
>>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames 
>>>> Housing Prices" dataset 
>>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am 
>>>> willing to submit the code change if the developers agree.
>>>>
>>>> The Boston dataset has the feature "Bk is the proportion of blacks 
>>>> in town". It is an incredibly racist "feature" to include in any 
>>>> dataset. I think is beneath us as data scientists.
>>>>
>>>> I submit that the Ames dataset is a viable alternative for learning 
>>>> regression. The author has shown that the dataset is a more robust 
>>>> replacement for Boston. Ames is a 2011 regression dataset on 
>>>> housing prices and has more than 5 times the amount of training 
>>>> examples with over 7 times as many features (none of which are 
>>>> morally questionable).
>>>>
>>>> I welcome the community's thoughts on the matter.
>>>>
>>>> Thanks.
>>>> -Tony
>>>>
>>>> Here's an article I wrote on the Boston dataset:
>>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D 
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170709/b9e43d33/attachment.html>


More information about the scikit-learn mailing list