[scikit-learn] Replacing the Boston Housing Prices dataset

Sun Jul 9 20:53:02 EDT 2017

And more to the point the discussion on Reddit:

https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/

Bill

On 7/9/17 5:13 PM, Bill Ross wrote:
>
> Possibly of interest:
>
> Race and ethnicity Imputation from Disease history with Deep LEarning
>
> https://github.com/jisungk/riddle
>
> Bill
>
> On 7/6/17 6:00 PM, Bill Ross wrote:
>> Unless the data concretely promotes discrimination, it seems 
>> discriminatory to exclude it.
>>
>> Bill
>>
>> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>>> I think there can be some middle ground. I.e., adding a new, simple 
>>> dataset to demonstrate regression (maybe autmpg, wine quality, or 
>>> sth like that) and use that for the scikit-learn examples in the 
>>> main documentation etc but leave the boston dataset in the code base 
>>> for now. Whether it's a weak argument or not, it would be quite 
>>> destructive to remove the dataset altogether in the next version or 
>>> so, not only because old tutorials use it but many unit tests in 
>>> many different projects depend on it. I think it might be better to 
>>> phase it out by having a good alternative first, and I am sure that 
>>> the scikit-learn maintainers wouldn't have anything against it if 
>>> someone would update the examples/tutorials with the use of 
>>> different datasets
>>>
>>> Best,
>>> Sebastian
>>>
>>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias 
>>>> <jni.soma at gmail.com> wrote:
>>>>
>>>> For what it's worth: I'm sympathetic to the argument that you can't 
>>>> fix the problem if you don't measure it, but I agree with Tony that 
>>>> "many tutorials use it" is an extremely weak argument. We removed 
>>>> Lena from scikit-image because it was the right thing to do. I very 
>>>> much doubt that Boston house prices is in more widespread use than 
>>>> Lena was in image processing.
>>>>
>>>> You can argue about whether or not it's morally right or wrong to 
>>>> include the dataset. I see merit to both arguments. But "too many 
>>>> tutorials use it" is very similar in flavour to "the economy of the 
>>>> South would collapse without slavery."
>>>>
>>>> Regarding fair uses of the feature, I would hope that all sklearn 
>>>> tutorials using the dataset mention such uses. The potential for 
>>>> abuse and misinterpretation is enormous.
>>>>
>>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber 
>>>> <jmschreiber91 at gmail.com>, wrote:
>>>>> Hi Tony
>>>>>
>>>>> As others have pointed out, I think that you may be 
>>>>> misunderstanding the purpose of that "feature." We are in 
>>>>> agreement that discrimination against protected classes is not OK, 
>>>>> and that even outside complying with the law one should avoid 
>>>>> discrimination, in model building or elsewhere. However, I 
>>>>> disagree that one does this by eliminating from all datasets any 
>>>>> feature that may allude to these protected classes. As Andreas 
>>>>> pointed out, there is a growing effort to ensure that machine 
>>>>> learning models are fair and benefit the common good (such as 
>>>>> FATML, DSSG, etc..), and from my understanding the general 
>>>>> consensus isn't necessarily that simply eliminating the feature is 
>>>>> sufficient. I think we are in agreement that naively learning a 
>>>>> model over a feature set containing questionable features and 
>>>>> calling it a day is not okay, but as others have pointed out, 
>>>>> having these features present and handling them appropriately can 
>>>>> help guard against the model implicitly learning unfair!
>>  !
>>>   biases (e
>>>   ven if they are not explicitly exposed to the feature).
>>>>> I would welcome the addition of the Ames dataset to the ones 
>>>>> supported by sklearn, but I'm not convinced that the Boston 
>>>>> dataset should be removed. As Andreas pointed out, there is a 
>>>>> benefit to having canonical examples present so that beginners can 
>>>>> easily follow along with the many tutorials that have been written 
>>>>> using them. As Sean points out, the paper itself is trying to pull 
>>>>> out the connection between house price and clean air in the 
>>>>> presence of possible confounding variables. In a more general 
>>>>> sense, saying that a feature shouldn't be there because a simple 
>>>>> linear regression is unaffected by the results is a bit odd 
>>>>> because it is very common for datasets to include irrelevant 
>>>>> features, and handling them appropriately is important. In 
>>>>> addition, one could argue that having this type of issue arise in 
>>>>> a toy dataset has a benefit because it exposes these types of 
>>>>> issues to those learning data science earlier on and allows them 
>>>>> to keep these issues in mind in the futur!
>> e!
>>>    when the
>>>    data is more serious.
>>>>> It is important for us all to keep issues of fairness in mind when 
>>>>> it comes to data science. I'm glad that you're speaking out in 
>>>>> favor of fairness and trying to bring attention to it.
>>>>>
>>>>> Jacob
>>>>>
>>>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante 
>>>>> <sean.violante at gmail.com> wrote:
>>>>> G Reina
>>>>> you make a bizarre argument. You argue that you should not even 
>>>>> check racism as a possible factor in house prices?
>>>>>
>>>>> But then you yourself check whether its relevant
>>>>> Then you say
>>>>>
>>>>> "but I'd argue that it's more due to the location (near water, 
>>>>> near businesses, near restaurants, near parks and recreation) than 
>>>>> to the ethnic makeup"
>>>>>
>>>>> Which  was basically what  the original authors wanted to show too,
>>>>>
>>>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand 
>>>>> for clean air', J. Environ. Economics & Management, vol.5, 81-102, 
>>>>> 1978.
>>>>>
>>>>>   but unless you measure ethnic make-up you cannot show that it is 
>>>>> not a confounder.
>>>>>
>>>>> The term "white flight" refers to affluent white families moving 
>>>>> to the suburbs.. And clearly a question is whether/how much was 
>>>>> racism or avoiding air pollution.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>>>>> I'd like to request that the "Boston Housing Prices" dataset in 
>>>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames 
>>>>> Housing Prices" dataset 
>>>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am 
>>>>> willing to submit the code change if the developers agree.
>>>>>
>>>>> The Boston dataset has the feature "Bk is the proportion of blacks 
>>>>> in town". It is an incredibly racist "feature" to include in any 
>>>>> dataset. I think is beneath us as data scientists.
>>>>>
>>>>> I submit that the Ames dataset is a viable alternative for 
>>>>> learning regression. The author has shown that the dataset is a 
>>>>> more robust replacement for Boston. Ames is a 2011 regression 
>>>>> dataset on housing prices and has more than 5 times the amount of 
>>>>> training examples with over 7 times as many features (none of 
>>>>> which are morally questionable).
>>>>>
>>>>> I welcome the community's thoughts on the matter.
>>>>>
>>>>> Thanks.
>>>>> -Tony
>>>>>
>>>>> Here's an article I wrote on the Boston dataset:
>>>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D 
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170709/750bd811/attachment-0001.html>