[scikit-learn] Replacing the Boston Housing Prices dataset
Valia Rodriguez
valia.rodriguez at gmail.com
Sat Jul 8 07:00:56 EDT 2017
Hello everybody
I just subscribed to this list to let you know what I think about this
topic as a black woman I am. My husband who is in this list told me
about the discussion going on and I wanted to share with all of you my
thoughts:
There is nothing wrong or racist in counting how many black people
there is in a given population, as it is not racist either to count
how many Asian or white people there are.
First in many epidemiologic, demographic and sociologic studies we
need to take in account -and do counting on the bases of -ethnicity,
skin color or race; depends on where in the world we are doing the
study and depends on the population we are counting. There is no other
way to address these topics if you do not count how many blacks,
whites, Asians and so on. Any teaching should simulate real
conditions, so a dataset including this is fine.
It is valid to count in the bases of skin color because if we don't,
how to study then distribution of wealth or even racism itself?
Second: there is nothing wrong with the word: black. That word should
not rise a flag. I am black and that is fine for me and for any other
person like me to be called black because we are -depends on the
context of course. As it is nothing wrong with being white and being
part of a counting for 'number of whites' for a specific study. It
will be very bad if the dataset says however 'number of coloured
people' to refer to black people, that would be very racist.
Valia
On Sat, Jul 8, 2017 at 10:31 AM, Matthew Brett <matthew.brett at gmail.com> wrote:
>
> Forwarded conversation
> Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
> ------------------------
>
> From: G Reina <greina at eng.ucsd.edu>
> Date: Thu, Jul 6, 2017 at 5:05 PM
> To: scikit-learn at python.org
>
>
> I'd like to request that the "Boston Housing Prices" dataset in sklearn
> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
> willing to submit the code change if the developers agree.
>
> The Boston dataset has the feature "Bk is the proportion of blacks in town".
> It is an incredibly racist "feature" to include in any dataset. I think is
> beneath us as data scientists.
>
> I submit that the Ames dataset is a viable alternative for learning
> regression. The author has shown that the dataset is a more robust
> replacement for Boston. Ames is a 2011 regression dataset on housing prices
> and has more than 5 times the amount of training examples with over 7 times
> as many features (none of which are morally questionable).
>
> I welcome the community's thoughts on the matter.
>
> Thanks.
> -Tony
>
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Andreas Mueller <t3kcit at gmail.com>
> Date: Thu, Jul 6, 2017 at 5:31 PM
> To: scikit-learn at python.org
>
>
> Hi Tony.
>
> I don't think it's a good idea to remove the dataset, given how many
> tutorials and examples rely on it.
> I also don't think it's a good idea to ignore racial discrimination, which I
> guess this feature is trying to capture.
>
> I was recently asked to remove an excerpt from a dataset from my slide, as
> it was "too racist". It was randomly sampled
> data from the adult census dataset. Unfortunately, economics in the US are
> not color blind (yet), and the reality is racist.
> I haven't done an in-depth analysis on whether this feature is actually
> informative, but I don't think your analysis is conclusive.
>
> Including ethnicity in data actually allows us to ensure "fairness" in
> certain decision making processes.
> Without collecting this data, it would be impossible to ensure automatic
> decisions are not influenced
> by past human biases. Arguably that's not what the authors of this dataset
> are doing.
>
> Check out http://www.fatml.org/ for more on fairness in machine learning and
> data science.
>
> Cheers,
> Andy
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: G Reina <greina at eng.ucsd.edu>
> Date: Thu, Jul 6, 2017 at 5:41 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Wow. I completely disagree.
>
> The fact that too many tutorials and examples rely on it is not a reason to
> keep the dataset. New tutorials are written all the time. And, as sklearn
> evolves some of the existing tutorials will need to be updated anyway to
> keep up with the changes.
>
> Including "ethnicity" is completely illegal in making business decisions in
> the United States. For example, credit scoring systems bend over backward to
> expunge even proxy features that could be highly correlated with race (for
> example, they can't include neighborhood, but can include entire counties).
>
> Let's leave the studying of racism to actual scientists who study racism.
> Not to toy datasets that we use to teach our students about a completely
> unrelated matter like regression.
>
> -Tony
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Andrew Holmes <andrewholmes82 at icloud.com>
> Date: Thu, Jul 6, 2017 at 5:19 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> But how do social scientists do research into racism without including
> ethnicity as a feature in the data?
>
> Best wishes
> Andrew
>
> Public Profile
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: jma <jeffrey.m.allard at gmail.com>
> Date: Thu, Jul 6, 2017 at 6:38 PM
> To: scikit-learn at python.org
>
>
> I work in the financial services industry and build machine learning models
> for marketing applications. We put an enormous effort (multiple layers of
> oversight and governance) into ensuring that our models are free of bias
> against protected classes etc. Having data describing race and ethnicity
> (among others) is extremely important to validate this is indeed the case.
> Without it, you have no such assurance.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Andreas Mueller <t3kcit at gmail.com>
> Date: Thu, Jul 6, 2017 at 7:09 PM
> To: scikit-learn at python.org
>
>
>
>
> On 07/06/2017 12:41 PM, G Reina wrote:
>>
>>
>> The fact that too many tutorials and examples rely on it is not a reason
>> to keep the dataset. New tutorials are written all the time. And, as sklearn
>> evolves some of the existing tutorials will need to be updated anyway to
>> keep up with the changes.
>
> No, we try to avoid that as much as possible.
> Old examples should work for as long as possible, and we actively avoid
> breaking API unnecessarily. It's one of the core principles of scikit-learn
> development.
>
> And new tutorials can use any dataset they choose. We are working on
> including an openml fetcher, which allows using more datasets more easily.
>
> ----------
> From: Sean Violante <sean.violante at gmail.com>
> Date: Thu, Jul 6, 2017 at 8:08 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> G Reina
> you make a bizarre argument. You argue that you should not even check racism
> as a possible factor in house prices?
>
> But then you yourself check whether its relevant
> Then you say
>
> "but I'd argue that it's more due to the location (near water, near
> businesses, near restaurants, near parks and recreation) than to the ethnic
> makeup"
>
> Which was basically what the original authors wanted to show too,
>
> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean
> air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>
> but unless you measure ethnic make-up you cannot show that it is not a
> confounder.
>
> The term "white flight" refers to affluent white families moving to the
> suburbs.. And clearly a question is whether/how much was racism or avoiding
> air pollution.
>
>
>
>
>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Jacob Schreiber <jmschreiber91 at gmail.com>
> Date: Thu, Jul 6, 2017 at 9:34 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Hi Tony
>
> As others have pointed out, I think that you may be misunderstanding the
> purpose of that "feature." We are in agreement that discrimination against
> protected classes is not OK, and that even outside complying with the law
> one should avoid discrimination, in model building or elsewhere. However, I
> disagree that one does this by eliminating from all datasets any feature
> that may allude to these protected classes. As Andreas pointed out, there is
> a growing effort to ensure that machine learning models are fair and benefit
> the common good (such as FATML, DSSG, etc..), and from my understanding the
> general consensus isn't necessarily that simply eliminating the feature is
> sufficient. I think we are in agreement that naively learning a model over a
> feature set containing questionable features and calling it a day is not
> okay, but as others have pointed out, having these features present and
> handling them appropriately can help guard against the model implicitly
> learning unfair biases (even if they are not explicitly exposed to the
> feature).
>
> I would welcome the addition of the Ames dataset to the ones supported by
> sklearn, but I'm not convinced that the Boston dataset should be removed. As
> Andreas pointed out, there is a benefit to having canonical examples present
> so that beginners can easily follow along with the many tutorials that have
> been written using them. As Sean points out, the paper itself is trying to
> pull out the connection between house price and clean air in the presence of
> possible confounding variables. In a more general sense, saying that a
> feature shouldn't be there because a simple linear regression is unaffected
> by the results is a bit odd because it is very common for datasets to
> include irrelevant features, and handling them appropriately is important.
> In addition, one could argue that having this type of issue arise in a toy
> dataset has a benefit because it exposes these types of issues to those
> learning data science earlier on and allows them to keep these issues in
> mind in the future when the data is more serious.
>
> It is important for us all to keep issues of fairness in mind when it comes
> to data science. I'm glad that you're speaking out in favor of fairness and
> trying to bring attention to it.
>
> Jacob
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Juan Nunez-Iglesias <jni.soma at gmail.com>
> Date: Fri, Jul 7, 2017 at 12:36 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> For what it's worth: I'm sympathetic to the argument that you can't fix the
> problem if you don't measure it, but I agree with Tony that "many tutorials
> use it" is an extremely weak argument. We removed Lena from scikit-image
> because it was the right thing to do. I very much doubt that Boston house
> prices is in more widespread use than Lena was in image processing.
>
> You can argue about whether or not it's morally right or wrong to include
> the dataset. I see merit to both arguments. But "too many tutorials use it"
> is very similar in flavour to "the economy of the South would collapse
> without slavery."
>
> Regarding fair uses of the feature, I would hope that all sklearn tutorials
> using the dataset mention such uses. The potential for abuse and
> misinterpretation is enormous.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Sebastian Raschka <se.raschka at gmail.com>
> Date: Fri, Jul 7, 2017 at 1:39 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> I think there can be some middle ground. I.e., adding a new, simple dataset
> to demonstrate regression (maybe autmpg, wine quality, or sth like that) and
> use that for the scikit-learn examples in the main documentation etc but
> leave the boston dataset in the code base for now. Whether it's a weak
> argument or not, it would be quite destructive to remove the dataset
> altogether in the next version or so, not only because old tutorials use it
> but many unit tests in many different projects depend on it. I think it
> might be better to phase it out by having a good alternative first, and I am
> sure that the scikit-learn maintainers wouldn't have anything against it if
> someone would update the examples/tutorials with the use of different
> datasets
>
> Best,
> Sebastian
>
> ----------
> From: Bill Ross <ross at cgl.ucsf.edu>
> Date: Fri, Jul 7, 2017 at 2:00 AM
> To: scikit-learn at python.org
>
>
> Unless the data concretely promotes discrimination, it seems discriminatory
> to exclude it.
>
> Bill
>
> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>>
>> I think there can be some middle ground. I.e., adding a new, simple
>> dataset to demonstrate regression (maybe autmpg, wine quality, or sth like
>> that) and use that for the scikit-learn examples in the main documentation
>> etc but leave the boston dataset in the code base for now. Whether it's a
>> weak argument or not, it would be quite destructive to remove the dataset
>> altogether in the next version or so, not only because old tutorials use it
>> but many unit tests in many different projects depend on it. I think it
>> might be better to phase it out by having a good alternative first, and I am
>> sure that the scikit-learn maintainers wouldn't have anything against it if
>> someone would update the examples/tutorials with the use of different
>> datasets
>>
>> Best,
>> Sebastian
>>
>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.soma at gmail.com>
>>> wrote:
>>>
>>> For what it's worth: I'm sympathetic to the argument that you can't fix
>>> the problem if you don't measure it, but I agree with Tony that "many
>>> tutorials use it" is an extremely weak argument. We removed Lena from
>>> scikit-image because it was the right thing to do. I very much doubt that
>>> Boston house prices is in more widespread use than Lena was in image
>>> processing.
>>>
>>> You can argue about whether or not it's morally right or wrong to include
>>> the dataset. I see merit to both arguments. But "too many tutorials use it"
>>> is very similar in flavour to "the economy of the South would collapse
>>> without slavery."
>>>
>>> Regarding fair uses of the feature, I would hope that all sklearn
>>> tutorials using the dataset mention such uses. The potential for abuse and
>>> misinterpretation is enormous.
>>>
>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>,
>>> wrote:
>>>>
>>>> Hi Tony
>>>>
>>>> As others have pointed out, I think that you may be misunderstanding the
>>>> purpose of that "feature." We are in agreement that discrimination against
>>>> protected classes is not OK, and that even outside complying with the law
>>>> one should avoid discrimination, in model building or elsewhere. However, I
>>>> disagree that one does this by eliminating from all datasets any feature
>>>> that may allude to these protected classes. As Andreas pointed out, there is
>>>> a growing effort to ensure that machine learning models are fair and benefit
>>>> the common good (such as FATML, DSSG, etc..), and from my understanding the
>>>> general consensus isn't necessarily that simply eliminating the feature is
>>>> sufficient. I think we are in agreement that naively learning a model over a
>>>> feature set containing questionable features and calling it a day is not
>>>> okay, but as others have pointed out, having these features present and
>>>> handling them appropriately can help guard against the model implicitly
>>>> learning unfair !
>>
>> biases (e
>> ven if they are not explicitly exposed to the feature).
>>>>
>>>> I would welcome the addition of the Ames dataset to the ones supported
>>>> by sklearn, but I'm not convinced that the Boston dataset should be removed.
>>>> As Andreas pointed out, there is a benefit to having canonical examples
>>>> present so that beginners can easily follow along with the many tutorials
>>>> that have been written using them. As Sean points out, the paper itself is
>>>> trying to pull out the connection between house price and clean air in the
>>>> presence of possible confounding variables. In a more general sense, saying
>>>> that a feature shouldn't be there because a simple linear regression is
>>>> unaffected by the results is a bit odd because it is very common for
>>>> datasets to include irrelevant features, and handling them appropriately is
>>>> important. In addition, one could argue that having this type of issue arise
>>>> in a toy dataset has a benefit because it exposes these types of issues to
>>>> those learning data science earlier on and allows them to keep these issues
>>>> in mind in the future!
>
>
> ----------
> From: Gael Varoquaux <gael.varoquaux at normalesup.org>
> Date: Fri, Jul 7, 2017 at 6:35 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Many people gave great points in this thread, in particular Jacob's well
> written email.
>
> Andy's point about tutorials is an important one. I don't resonate at
> all with Juan's message. Breaking people's code, even if it is the notes
> that they use to give a lecture, is a real cost for them. The cost varies
> on a case to case basis. But there are still books printed out there
> that demo image processing on Lena, and these will be out for decades.
> More importantly, the replacement of Lena used in scipy (the raccoon)
> does not allow to demonstrate denoising properly (Lena has smooth regions
> with details in the middle: the eyes), or segmentation. In effect, it has
> made the examples for the ecosystem less convincing.
>
>
> Of course, by definition, refusing to change anything implies that
> unfortunate situations, such as discriminatory biases, cannot be fixed.
> This is why changes should be considered on a case-to-case basis.
>
> The problem that we are facing here is that a dataset about society, the
> Boston housing dataset, can reveal discrimination. However, this is true
> of every data about society. The classic adult data (extracted from the
> American census) easily reveals income discrimination. I teach statistics
> with an IQ dataset where it is easy to show a male vs female IQ
> difference. This difference disappears after controlling for education
> (and the purpose of my course is to teach people to control for
> confounding effects).
>
> Data about society reveals its inequalities. Not working on such data is
> hiding problems, not fixing them. It is true that misuse of such data can
> attempt to establish inequalities as facts of life and get them accepted.
> When discussing these issues, we need to educate people about how to run
> and interpret analyses.
>
>
> No the Boston data will not go. No it is not a good thing to pretend that
> social problems do not exist.
>
>
> Gaël
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone: ++ 33-1-69-08-79-68
> http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
>
> ----------
> From: Juan Nunez-Iglesias <jni.soma at gmail.com>
> Date: Sat, Jul 8, 2017 at 3:44 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Just to clarify a couple of things about my position.
>
> First, thanks Gaël for a thoughtful response. I fully respect your decision
> to keep the Boston dataset, and I agree that it can be a useful "teaching
> moment." (As I suggested in my earlier post.)
>
> With regards to breaking tutorials, however, I totally disagree. The whole
> value of tutorials is that they teach general principles, not analysis of
> specific datasets. Changing a tutorial dataset is thus different from
> changing an API. This isn't the right forum for a discussion about the
> ethics of the Lena image, so I won't go into that, but to suggest that it is
> a uniquely effective picture, the natural image equivalent of a standard
> test pattern, is ludicrous. Maybe the replacement wasn't as good, but that
> is a criticism of the choice of replacement, not of the decision to replace
> it. There clearly exist millions or billions of images with similarly good
> teaching characteristics.
>
> Finally, yes, removing and deprecating datasets incurs (and inflicts) a real
> cost, but cost should be at best a minor consideration when dealing with
> ethical questions. History, and daily life, are replete with unethical
> decisions made under the excuse that it would cost too much to do what's
> right. Ultimately the costs are usually found to have been exaggerated.
>
> With regards to this dataset, I cede the argument to maintainers,
> contributors, and users of the dataset, but I will point out that none of
> the existing tutorials in the library mention this feature, let alone
> addresses the ethics of it. The DESCR field mentions it entirely
> nonchalantly, like it is a natural thing to want to measure if one wants to
> predict house prices. I think I would certainly have a WTF moment, at least,
> if I was a black student reading through that description.
>
> Juan.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Jacob Schreiber <jmschreiber91 at gmail.com>
> Date: Sat, Jul 8, 2017 at 5:26 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> We would welcome a pull request amending the documentation to include a
> neutral discussion of the issues you've brought up. Optimally, it would
> include many of the points brought up in this discussion as to why it was
> ultimately kept despite the issues being raised.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
--
Valia Rodriguez, MD PhD
Neurophysiology Lecturer. School of Life and Health Sciences, Aston University
Professor of Clinical Neurophysiology, Cuban Neuroscience Center
More information about the scikit-learn
mailing list