<div dir="ltr"><div>We would welcome a pull request amending the documentation to include a neutral discussion of the issues you've brought up. Optimally, it would include many of the points brought up in this discussion as to why it was ultimately kept despite the issues being raised.<br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 7, 2017 at 7:44 PM, Juan Nunez-Iglesias <span dir="ltr"><<a href="mailto:jni.soma@gmail.com" target="_blank">jni.soma@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div>

<div name="messageBodySection" style="font-size:14px;font-family:-apple-system,BlinkMacSystemFont,sans-serif">Just to clarify a couple of things about my position.

<div><br></div>

<div>First, thanks Gaël for a thoughtful response. I fully respect your decision to keep the Boston dataset, and I agree that it can be a useful "teaching moment." (As I suggested in my earlier post.)</div>

<div><br></div>

<div>With regards to breaking tutorials, however, I totally disagree. The whole value of tutorials is that they teach general principles, not analysis of specific datasets. Changing a tutorial dataset is thus different from changing an API. This isn't the right forum for a discussion about the ethics of the Lena image, so I won't go into that, but to suggest that it is a uniquely effective picture, the natural image equivalent of a standard test pattern, is ludicrous. Maybe the replacement wasn't as good, but that is a criticism of the choice of replacement, not of the decision to replace it. There clearly exist millions or billions of images with similarly good teaching characteristics.</div>

<div><br></div>

<div>Finally, yes, removing and deprecating datasets incurs (and inflicts) a real cost, but cost should be at best a minor consideration when dealing with ethical questions. History, and daily life, are replete with unethical decisions made under the excuse that it would cost too much to do what's right. Ultimately the costs are usually found to have been exaggerated.</div>

<div><br></div>

<div>With regards to this dataset, I cede the argument to maintainers, contributors, and users of the dataset, but I will point out that none of the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html" target="_blank">existing tutorials</a> in the library mention this feature, let alone addresses the ethics of it. The DESCR field mentions it entirely nonchalantly, like it is a natural thing to want to measure if one wants to predict house prices. I think I would certainly have a WTF moment, at least, if I was a black student reading through that description.</div><span class="HOEnZb"><font color="#888888">

<div><br></div>

<div>Juan.</div>

</font></span></div><div><div class="h5">

<div name="messageReplySection" style="font-size:14px;font-family:-apple-system,BlinkMacSystemFont,sans-serif"><br>

On 7 Jul 2017, 3:36 PM +1000, Gael Varoquaux <<a href="mailto:gael.varoquaux@normalesup.org" target="_blank">gael.varoquaux@normalesup.org</a><wbr>>, wrote:<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #1abc9c">Many people gave great points in this thread, in particular Jacob's well<br>

written email.<br>

<br>

Andy's point about tutorials is an important one. I don't resonate at<br>

all with Juan's message. Breaking people's code, even if it is the notes<br>

that they use to give a lecture, is a real cost for them. The cost varies<br>

on a case to case basis. But there are still books printed out there<br>

that demo image processing on Lena, and these will be out for decades.<br>

More importantly, the replacement of Lena used in scipy (the raccoon)<br>

does not allow to demonstrate denoising properly (Lena has smooth regions<br>

with details in the middle: the eyes), or segmentation. In effect, it has<br>

made the examples for the ecosystem less convincing.<br>

<br>

<br>

Of course, by definition, refusing to change anything implies that<br>

unfortunate situations, such as discriminatory biases, cannot be fixed.<br>

This is why changes should be considered on a case-to-case basis.<br>

<br>

The problem that we are facing here is that a dataset about society, the<br>

Boston housing dataset, can reveal discrimination. However, this is true<br>

of every data about society. The classic adult data (extracted from the<br>

American census) easily reveals income discrimination. I teach statistics<br>

with an IQ dataset where it is easy to show a male vs female IQ<br>

difference. This difference disappears after controlling for education<br>

(and the purpose of my course is to teach people to control for<br>

confounding effects).<br>

<br>

Data about society reveals its inequalities. Not working on such data is<br>

hiding problems, not fixing them. It is true that misuse of such data can<br>

attempt to establish inequalities as facts of life and get them accepted.<br>

When discussing these issues, we need to educate people about how to run<br>

and interpret analyses.<br>

<br>

<br>

No the Boston data will not go. No it is not a good thing to pretend that<br>

social problems do not exist.<br>

<br>

<br>

Gaël<br>

<br>

On Fri, Jul 07, 2017 at 09:36:41AM +1000, Juan Nunez-Iglesias wrote:<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">For what it's worth: I'm sympathetic to the argument that you can't fix the<br>

problem if you don't measure it, but I agree with Tony that "many tutorials use<br>

it" is an extremely weak argument. We removed Lena from scikit-image because it<br>

was the right thing to do. I very much doubt that Boston house prices is in<br>

more widespread use than Lena was in image processing.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">You can argue about whether or not it's morally right or wrong to include the<br>

dataset. I see merit to both arguments. But "too many tutorials use it" is very<br>

similar in flavour to "the economy of the South would collapse without<br>

slavery."<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">Regarding fair uses of the feature, I would hope that all sklearn tutorials<br>

using the dataset mention such uses. The potential for abuse and<br>

misinterpretation is enormous.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <<a href="mailto:jmschreiber91@gmail.com" target="_blank">jmschreiber91@gmail.com</a>>, wrote:<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">Hi Tony<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">As others have pointed out, I think that you may be misunderstanding the<br>

purpose of that "feature." We are in agreement that discrimination against<br>

protected classes is not OK, and that even outside complying with the law<br>

one should avoid discrimination, in model building or elsewhere. However, I<br>

disagree that one does this by eliminating from all datasets any feature<br>

that may allude to these protected classes. As Andreas pointed out, there<br>

is a growing effort to ensure that machine learning models are fair and<br>

benefit the common good (such as FATML, DSSG, etc..), and from my<br>

understanding the general consensus isn't necessarily that simply<br>

eliminating the feature is sufficient. I think we are in agreement that<br>

naively learning a model over a feature set containing questionable<br>

features and calling it a day is not okay, but as others have pointed out,<br>

having these features present and handling them appropriately can help<br>

guard against the model implicitly learning unfair biases (even if they are<br>

not explicitly exposed to the feature).<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">I would welcome the addition of the Ames dataset to the ones supported by<br>

sklearn, but I'm not convinced that the Boston dataset should be removed.<br>

As Andreas pointed out, there is a benefit to having canonical examples<br>

present so that beginners can easily follow along with the many tutorials<br>

that have been written using them. As Sean points out, the paper itself is<br>

trying to pull out the connection between house price and clean air in the<br>

presence of possible confounding variables. In a more general sense, saying<br>

that a feature shouldn't be there because a simple linear regression is<br>

unaffected by the results is a bit odd because it is very common for<br>

datasets to include irrelevant features, and handling them appropriately is<br>

important. In addition, one could argue that having this type of issue<br>

arise in a toy dataset has a benefit because it exposes these types of<br>

issues to those learning data science earlier on and allows them to keep<br>

these issues in mind in the future when the data is more serious.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">It is important for us all to keep issues of fairness in mind when it comes<br>

to data science. I'm glad that you're speaking out in favor of fairness and<br>

trying to bring attention to it.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">Jacob<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <<a href="mailto:sean.violante@gmail.com" target="_blank">sean.violante@gmail.com</a><br>

wrote:<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">G Reina<br>

you make a bizarre argument. You argue that you should not even check<br>

racism as a possible factor in house prices?<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">But then you yourself check whether its relevant<br>

Then you say<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">"but I'd argue that it's more due to the location (near water, near<br>

businesses, near restaurants, near parks and recreation) than to the<br>

ethnic makeup"<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">Which was basically what the original authors wanted to show too,<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for<br>

clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">but unless you measure ethnic make-up you cannot show that it is not a<br>

confounder.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">The term "white flight" refers to affluent white families moving to the<br>

suburbs.. And clearly a question is whether/how much was racism or<br>

avoiding air pollution.<br></blockquote>

<br>

<br>

<br>

<br>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">On 6 Jul 2017 6:10 pm, "G Reina" <<a href="mailto:greina@eng.ucsd.edu" target="_blank">greina@eng.ucsd.edu</a>> wrote:<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">I'd like to request that the "Boston Housing Prices" dataset in<br>

sklearn (sklearn.datasets.load_boston) be replaced with the "Ames<br>

Housing Prices" dataset (<a href="https://ww2.amstat.org/publications/jse/" target="_blank">https://ww2.amstat.org/<wbr>publications/jse/</a><br>

v19n3/decock.pdf). I am willing to submit the code change if the<br>

developers agree.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">The Boston dataset has the feature "Bk is the proportion of blacks<br>

in town". It is an incredibly racist "feature" to include in any<br>

dataset. I think is beneath us as data scientists.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">I submit that the Ames dataset is a viable alternative for learning<br>

regression. The author has shown that the dataset is a more robust<br>

replacement for Boston. Ames is a 2011 regression dataset on<br>

housing prices and has more than 5 times the amount of training<br>

examples with over 7 times as many features (none of which are<br>

morally questionable).<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">I welcome the community's thoughts on the matter.<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">Thanks.<br>

-Tony<br></blockquote>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">Here's an article I wrote on the Boston dataset:<br>

<a href="https://www.linkedin.com/pulse/hidden-racism-data-science-g-" target="_blank">https://www.linkedin.com/<wbr>pulse/hidden-racism-data-<wbr>science-g-</a><br>

anthony-reina?trk=v-feed&lipi=<wbr>urn%3Ali%3Apage%3Ad_flagship3_<br>

feed%3Bmu67f2GSzj5xHMpSD6M00A%<wbr>3D%3D<br></blockquote>

<br>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br></blockquote>

<br>

<br>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br></blockquote>

<br>

<br>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br></blockquote>

<br>

<br>

<blockquote type="cite" style="margin:5px 5px;padding-left:10px;border-left:thin solid #e67e22">______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br></blockquote>

<br>

<br>

--<br>

Gael Varoquaux<br>

Researcher, INRIA Parietal<br>

NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France<br>

Phone: <a href="tel:+33%201%2069%2008%2079%2068" value="+33169087968" target="_blank">++ 33-1-69-08-79-68</a><br>

<a href="http://gael-varoquaux.info" target="_blank">http://gael-varoquaux.info</a> <a href="http://twitter.com/GaelVaroquaux" target="_blank">http://twitter.com/<wbr>GaelVaroquaux</a><br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br></blockquote>

</div>

</div></div></div>

<br>______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

<br></blockquote></div><br></div>