[scikit-learn] Replacing the Boston Housing Prices dataset
Bill Ross
ross at cgl.ucsf.edu
Sun Jul 9 20:53:02 EDT 2017
And more to the point the discussion on Reddit:
https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/
Bill
On 7/9/17 5:13 PM, Bill Ross wrote:
>
> Possibly of interest:
>
> Race and ethnicity Imputation from Disease history with Deep LEarning
>
> https://github.com/jisungk/riddle
>
> Bill
>
> On 7/6/17 6:00 PM, Bill Ross wrote:
>> Unless the data concretely promotes discrimination, it seems
>> discriminatory to exclude it.
>>
>> Bill
>>
>> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>>> I think there can be some middle ground. I.e., adding a new, simple
>>> dataset to demonstrate regression (maybe autmpg, wine quality, or
>>> sth like that) and use that for the scikit-learn examples in the
>>> main documentation etc but leave the boston dataset in the code base
>>> for now. Whether it's a weak argument or not, it would be quite
>>> destructive to remove the dataset altogether in the next version or
>>> so, not only because old tutorials use it but many unit tests in
>>> many different projects depend on it. I think it might be better to
>>> phase it out by having a good alternative first, and I am sure that
>>> the scikit-learn maintainers wouldn't have anything against it if
>>> someone would update the examples/tutorials with the use of
>>> different datasets
>>>
>>> Best,
>>> Sebastian
>>>
>>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias
>>>> <jni.soma at gmail.com> wrote:
>>>>
>>>> For what it's worth: I'm sympathetic to the argument that you can't
>>>> fix the problem if you don't measure it, but I agree with Tony that
>>>> "many tutorials use it" is an extremely weak argument. We removed
>>>> Lena from scikit-image because it was the right thing to do. I very
>>>> much doubt that Boston house prices is in more widespread use than
>>>> Lena was in image processing.
>>>>
>>>> You can argue about whether or not it's morally right or wrong to
>>>> include the dataset. I see merit to both arguments. But "too many
>>>> tutorials use it" is very similar in flavour to "the economy of the
>>>> South would collapse without slavery."
>>>>
>>>> Regarding fair uses of the feature, I would hope that all sklearn
>>>> tutorials using the dataset mention such uses. The potential for
>>>> abuse and misinterpretation is enormous.
>>>>
>>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber
>>>> <jmschreiber91 at gmail.com>, wrote:
>>>>> Hi Tony
>>>>>
>>>>> As others have pointed out, I think that you may be
>>>>> misunderstanding the purpose of that "feature." We are in
>>>>> agreement that discrimination against protected classes is not OK,
>>>>> and that even outside complying with the law one should avoid
>>>>> discrimination, in model building or elsewhere. However, I
>>>>> disagree that one does this by eliminating from all datasets any
>>>>> feature that may allude to these protected classes. As Andreas
>>>>> pointed out, there is a growing effort to ensure that machine
>>>>> learning models are fair and benefit the common good (such as
>>>>> FATML, DSSG, etc..), and from my understanding the general
>>>>> consensus isn't necessarily that simply eliminating the feature is
>>>>> sufficient. I think we are in agreement that naively learning a
>>>>> model over a feature set containing questionable features and
>>>>> calling it a day is not okay, but as others have pointed out,
>>>>> having these features present and handling them appropriately can
>>>>> help guard against the model implicitly learning unfair!
>> !
>>> biases (e
>>> ven if they are not explicitly exposed to the feature).
>>>>> I would welcome the addition of the Ames dataset to the ones
>>>>> supported by sklearn, but I'm not convinced that the Boston
>>>>> dataset should be removed. As Andreas pointed out, there is a
>>>>> benefit to having canonical examples present so that beginners can
>>>>> easily follow along with the many tutorials that have been written
>>>>> using them. As Sean points out, the paper itself is trying to pull
>>>>> out the connection between house price and clean air in the
>>>>> presence of possible confounding variables. In a more general
>>>>> sense, saying that a feature shouldn't be there because a simple
>>>>> linear regression is unaffected by the results is a bit odd
>>>>> because it is very common for datasets to include irrelevant
>>>>> features, and handling them appropriately is important. In
>>>>> addition, one could argue that having this type of issue arise in
>>>>> a toy dataset has a benefit because it exposes these types of
>>>>> issues to those learning data science earlier on and allows them
>>>>> to keep these issues in mind in the futur!
>> e!
>>> when the
>>> data is more serious.
>>>>> It is important for us all to keep issues of fairness in mind when
>>>>> it comes to data science. I'm glad that you're speaking out in
>>>>> favor of fairness and trying to bring attention to it.
>>>>>
>>>>> Jacob
>>>>>
>>>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante
>>>>> <sean.violante at gmail.com> wrote:
>>>>> G Reina
>>>>> you make a bizarre argument. You argue that you should not even
>>>>> check racism as a possible factor in house prices?
>>>>>
>>>>> But then you yourself check whether its relevant
>>>>> Then you say
>>>>>
>>>>> "but I'd argue that it's more due to the location (near water,
>>>>> near businesses, near restaurants, near parks and recreation) than
>>>>> to the ethnic makeup"
>>>>>
>>>>> Which was basically what the original authors wanted to show too,
>>>>>
>>>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand
>>>>> for clean air', J. Environ. Economics & Management, vol.5, 81-102,
>>>>> 1978.
>>>>>
>>>>> but unless you measure ethnic make-up you cannot show that it is
>>>>> not a confounder.
>>>>>
>>>>> The term "white flight" refers to affluent white families moving
>>>>> to the suburbs.. And clearly a question is whether/how much was
>>>>> racism or avoiding air pollution.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>>>>> I'd like to request that the "Boston Housing Prices" dataset in
>>>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames
>>>>> Housing Prices" dataset
>>>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
>>>>> willing to submit the code change if the developers agree.
>>>>>
>>>>> The Boston dataset has the feature "Bk is the proportion of blacks
>>>>> in town". It is an incredibly racist "feature" to include in any
>>>>> dataset. I think is beneath us as data scientists.
>>>>>
>>>>> I submit that the Ames dataset is a viable alternative for
>>>>> learning regression. The author has shown that the dataset is a
>>>>> more robust replacement for Boston. Ames is a 2011 regression
>>>>> dataset on housing prices and has more than 5 times the amount of
>>>>> training examples with over 7 times as many features (none of
>>>>> which are morally questionable).
>>>>>
>>>>> I welcome the community's thoughts on the matter.
>>>>>
>>>>> Thanks.
>>>>> -Tony
>>>>>
>>>>> Here's an article I wrote on the Boston dataset:
>>>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170709/750bd811/attachment-0001.html>
More information about the scikit-learn
mailing list