<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Possibly of interest:</p>
    <p><span style="color: rgb(36, 41, 46); font-family: -apple-system,
        system-ui, "Segoe UI", Helvetica, Arial, sans-serif,
        "Apple Color Emoji", "Segoe UI Emoji",
        "Segoe UI Symbol"; font-size: 16px; font-style:
        normal; font-variant-ligatures: normal; font-variant-caps:
        normal; font-weight: normal; letter-spacing: normal; orphans: 2;
        text-align: start; text-indent: 0px; text-transform: none;
        white-space: normal; widows: 2; word-spacing: 0px;
        -webkit-text-stroke-width: 0px; background-color: rgb(255, 255,
        255); text-decoration-style: initial; text-decoration-color:
        initial; display: inline !important; float: none;">Race and
        ethnicity Imputation from Disease history with Deep LEarning</span></p>
    <p><a class="moz-txt-link-freetext" href="https://github.com/jisungk/riddle">https://github.com/jisungk/riddle</a><br>
    </p>
    Bill<br>
    <br>
    <div class="moz-cite-prefix">On 7/6/17 6:00 PM, Bill Ross wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu">Unless
      the data concretely promotes discrimination, it seems
      discriminatory to exclude it.
      <br>
      <br>
      Bill
      <br>
      <br>
      On 7/6/17 5:39 PM, Sebastian Raschka wrote:
      <br>
      <blockquote type="cite">I think there can be some middle ground.
        I.e., adding a new, simple dataset to demonstrate regression
        (maybe autmpg, wine quality, or sth like that) and use that for
        the scikit-learn examples in the main documentation etc but
        leave the boston dataset in the code base for now. Whether it's
        a weak argument or not, it would be quite destructive to remove
        the dataset altogether in the next version or so, not only
        because old tutorials use it but many unit tests in many
        different projects depend on it. I think it might be better to
        phase it out by having a good alternative first, and I am sure
        that the scikit-learn maintainers wouldn't have anything against
        it if someone would update the examples/tutorials with the use
        of different datasets
        <br>
        <br>
        Best,
        <br>
        Sebastian
        <br>
        <br>
        <blockquote type="cite">On Jul 6, 2017, at 7:36 PM, Juan
          Nunez-Iglesias <a class="moz-txt-link-rfc2396E" href="mailto:jni.soma@gmail.com"><jni.soma@gmail.com></a> wrote:
          <br>
          <br>
          For what it's worth: I'm sympathetic to the argument that you
          can't fix the problem if you don't measure it, but I agree
          with Tony that "many tutorials use it" is an extremely weak
          argument. We removed Lena from scikit-image because it was the
          right thing to do. I very much doubt that Boston house prices
          is in more widespread use than Lena was in image processing.
          <br>
          <br>
          You can argue about whether or not it's morally right or wrong
          to include the dataset. I see merit to both arguments. But
          "too many tutorials use it" is very similar in flavour to "the
          economy of the South would collapse without slavery."
          <br>
          <br>
          Regarding fair uses of the feature, I would hope that all
          sklearn tutorials using the dataset mention such uses. The
          potential for abuse and misinterpretation is enormous.
          <br>
          <br>
          On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber
          <a class="moz-txt-link-rfc2396E" href="mailto:jmschreiber91@gmail.com"><jmschreiber91@gmail.com></a>, wrote:
          <br>
          <blockquote type="cite">Hi Tony
            <br>
            <br>
            As others have pointed out, I think that you may be
            misunderstanding the purpose of that "feature." We are in
            agreement that discrimination against protected classes is
            not OK, and that even outside complying with the law one
            should avoid discrimination, in model building or elsewhere.
            However, I disagree that one does this by eliminating from
            all datasets any feature that may allude to these protected
            classes. As Andreas pointed out, there is a growing effort
            to ensure that machine learning models are fair and benefit
            the common good (such as FATML, DSSG, etc..), and from my
            understanding the general consensus isn't necessarily that
            simply eliminating the feature is sufficient. I think we are
            in agreement that naively learning a model over a feature
            set containing questionable features and calling it a day is
            not okay, but as others have pointed out, having these
            features present and handling them appropriately can help
            guard against the model implicitly learning unfair!
            <br>
          </blockquote>
        </blockquote>
      </blockquote>
       !
      <br>
      <blockquote type="cite">  biases (e
        <br>
          ven if they are not explicitly exposed to the feature).
        <br>
        <blockquote type="cite">
          <blockquote type="cite">I would welcome the addition of the
            Ames dataset to the ones supported by sklearn, but I'm not
            convinced that the Boston dataset should be removed. As
            Andreas pointed out, there is a benefit to having canonical
            examples present so that beginners can easily follow along
            with the many tutorials that have been written using them.
            As Sean points out, the paper itself is trying to pull out
            the connection between house price and clean air in the
            presence of possible confounding variables. In a more
            general sense, saying that a feature shouldn't be there
            because a simple linear regression is unaffected by the
            results is a bit odd because it is very common for datasets
            to include irrelevant features, and handling them
            appropriately is important. In addition, one could argue
            that having this type of issue arise in a toy dataset has a
            benefit because it exposes these types of issues to those
            learning data science earlier on and allows them to keep
            these issues in mind in the futur!
            <br>
          </blockquote>
        </blockquote>
      </blockquote>
      e!
      <br>
      <blockquote type="cite">   when the
        <br>
           data is more serious.
        <br>
        <blockquote type="cite">
          <blockquote type="cite">It is important for us all to keep
            issues of fairness in mind when it comes to data science.
            I'm glad that you're speaking out in favor of fairness and
            trying to bring attention to it.
            <br>
            <br>
            Jacob
            <br>
            <br>
            On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante
            <a class="moz-txt-link-rfc2396E" href="mailto:sean.violante@gmail.com"><sean.violante@gmail.com></a> wrote:
            <br>
            G Reina
            <br>
            you make a bizarre argument. You argue that you should not
            even check racism as a possible factor in house prices?
            <br>
            <br>
            But then you yourself check whether its relevant
            <br>
            Then you say
            <br>
            <br>
            "but I'd argue that it's more due to the location (near
            water, near businesses, near restaurants, near parks and
            recreation) than to the ethnic makeup"
            <br>
            <br>
            Which  was basically what  the original authors wanted to
            show too,
            <br>
            <br>
            Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the
            demand for clean air', J. Environ. Economics &
            Management, vol.5, 81-102, 1978.
            <br>
            <br>
              but unless you measure ethnic make-up you cannot show that
            it is not a confounder.
            <br>
            <br>
            The term "white flight" refers to affluent white families
            moving to the suburbs.. And clearly a question is
            whether/how much was racism or avoiding air pollution.
            <br>
            <br>
            <br>
            <br>
            <br>
            <br>
            On 6 Jul 2017 6:10 pm, "G Reina" <a class="moz-txt-link-rfc2396E" href="mailto:greina@eng.ucsd.edu"><greina@eng.ucsd.edu></a>
            wrote:
            <br>
            I'd like to request that the "Boston Housing Prices" dataset
            in sklearn (sklearn.datasets.load_boston) be replaced with
            the "Ames Housing Prices" dataset
            (<a class="moz-txt-link-freetext" href="https://ww2.amstat.org/publications/jse/v19n3/decock.pdf">https://ww2.amstat.org/publications/jse/v19n3/decock.pdf</a>).
            I am willing to submit the code change if the developers
            agree.
            <br>
            <br>
            The Boston dataset has the feature "Bk is the proportion of
            blacks in town". It is an incredibly racist "feature" to
            include in any dataset. I think is beneath us as data
            scientists.
            <br>
            <br>
            I submit that the Ames dataset is a viable alternative for
            learning regression. The author has shown that the dataset
            is a more robust replacement for Boston. Ames is a 2011
            regression dataset on housing prices and has more than 5
            times the amount of training examples with over 7 times as
            many features (none of which are morally questionable).
            <br>
            <br>
            I welcome the community's thoughts on the matter.
            <br>
            <br>
            Thanks.
            <br>
            -Tony
            <br>
            <br>
            Here's an article I wrote on the Boston dataset:
            <br>
<a class="moz-txt-link-freetext" href="https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D">https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D</a>
            <br>
            <br>
            <br>
            _______________________________________________
            <br>
            scikit-learn mailing list
            <br>
            <a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
            <br>
            <a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
            <br>
            <br>
            <br>
            _______________________________________________
            <br>
            scikit-learn mailing list
            <br>
            <a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
            <br>
            <a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
            <br>
            <br>
            <br>
            _______________________________________________
            <br>
            scikit-learn mailing list
            <br>
            <a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
            <br>
            <a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
            <br>
          </blockquote>
          _______________________________________________
          <br>
          scikit-learn mailing list
          <br>
          <a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
          <br>
          <a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
          <br>
        </blockquote>
        _______________________________________________
        <br>
        scikit-learn mailing list
        <br>
        <a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
        <br>
        <a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
        <br>
      </blockquote>
      <br>
      <br>
      _______________________________________________
      <br>
      scikit-learn mailing list
      <br>
      <a class="moz-txt-link-abbreviated" href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>
      <br>
      <a class="moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/scikit-learn">https://mail.python.org/mailman/listinfo/scikit-learn</a>
      <br>
    </blockquote>
    <br>
  </body>
</html>