<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>And more to the point the discussion on Reddit:</p>

    <p>Â 

<a class="moz-txt-link-freetext" href="https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/">https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/</a><br>

    </p>

    Bill<br>

    <br>

    <div class="moz-cite-prefix">On 7/9/17 5:13 PM, Bill Ross wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:65531062-c9d6-ce7f-b712-0a1abd3cd935@cgl.ucsf.edu">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <p>Possibly of interest:</p>

      <p><span style="color: rgb(36, 41, 46); font-family:

          -apple-system, system-ui, "Segoe UI", Helvetica,

          Arial, sans-serif, "Apple Color Emoji", "Segoe

          UI Emoji", "Segoe UI Symbol"; font-size: 16px;

          font-style: normal; font-variant-ligatures: normal;

          font-variant-caps: normal; font-weight: normal;

          letter-spacing: normal; orphans: 2; text-align: start;

          text-indent: 0px; text-transform: none; white-space: normal;

          widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;

          background-color: rgb(255, 255, 255); text-decoration-style:

          initial; text-decoration-color: initial; display: inline

          !important; float: none;">Race and ethnicity Imputation from

          Disease history with Deep LEarning</span></p>

      <p><a class="moz-txt-link-freetext"

          href="https://github.com/jisungk/riddle"

          moz-do-not-send="true">https://github.com/jisungk/riddle</a><br>

      </p>

      Bill<br>

      <br>

      <div class="moz-cite-prefix">On 7/6/17 6:00 PM, Bill Ross wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu">Unless

        the data concretely promotes discrimination, it seems

        discriminatory to exclude it. <br>

        <br>

        Bill <br>

        <br>

        On 7/6/17 5:39 PM, Sebastian Raschka wrote: <br>

        <blockquote type="cite">I think there can be some middle ground.

          I.e., adding a new, simple dataset to demonstrate regression

          (maybe autmpg, wine quality, or sth like that) and use that

          for the scikit-learn examples in the main documentation etc

          but leave the boston dataset in the code base for now. Whether

          it's a weak argument or not, it would be quite destructive to

          remove the dataset altogether in the next version or so, not

          only because old tutorials use it but many unit tests in many

          different projects depend on it. I think it might be better to

          phase it out by having a good alternative first, and I am sure

          that the scikit-learn maintainers wouldn't have anything

          against it if someone would update the examples/tutorials with

          the use of different datasets <br>

          <br>

          Best, <br>

          Sebastian <br>

          <br>

          <blockquote type="cite">On Jul 6, 2017, at 7:36 PM, Juan

            Nunez-Iglesias <a class="moz-txt-link-rfc2396E"

              href="mailto:jni.soma@gmail.com" moz-do-not-send="true"><jni.soma@gmail.com></a>

            wrote: <br>

            <br>

            For what it's worth: I'm sympathetic to the argument that

            you can't fix the problem if you don't measure it, but I

            agree with Tony that "many tutorials use it" is an extremely

            weak argument. We removed Lena from scikit-image because it

            was the right thing to do. I very much doubt that Boston

            house prices is in more widespread use than Lena was in

            image processing. <br>

            <br>

            You can argue about whether or not it's morally right or

            wrong to include the dataset. I see merit to both arguments.

            But "too many tutorials use it" is very similar in flavour

            to "the economy of the South would collapse without

            slavery." <br>

            <br>

            Regarding fair uses of the feature, I would hope that all

            sklearn tutorials using the dataset mention such uses. The

            potential for abuse and misinterpretation is enormous. <br>

            <br>

            On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <a

              class="moz-txt-link-rfc2396E"

              href="mailto:jmschreiber91@gmail.com"

              moz-do-not-send="true"><jmschreiber91@gmail.com></a>,

            wrote: <br>

            <blockquote type="cite">Hi Tony <br>

              <br>

              As others have pointed out, I think that you may be

              misunderstanding the purpose of that "feature." We are in

              agreement that discrimination against protected classes is

              not OK, and that even outside complying with the law one

              should avoid discrimination, in model building or

              elsewhere. However, I disagree that one does this by

              eliminating from all datasets any feature that may allude

              to these protected classes. As Andreas pointed out, there

              is a growing effort to ensure that machine learning models

              are fair and benefit the common good (such as FATML, DSSG,

              etc..), and from my understanding the general consensus

              isn't necessarily that simply eliminating the feature is

              sufficient. I think we are in agreement that naively

              learning a model over a feature set containing

              questionable features and calling it a day is not okay,

              but as others have pointed out, having these features

              present and handling them appropriately can help guard

              against the model implicitly learning unfair! <br>

            </blockquote>

          </blockquote>

        </blockquote>

        Â ! <br>

        <blockquote type="cite">Â  biases (e <br>

          Â  ven if they are not explicitly exposed to the feature). <br>

          <blockquote type="cite">

            <blockquote type="cite">I would welcome the addition of the

              Ames dataset to the ones supported by sklearn, but I'm not

              convinced that the Boston dataset should be removed. As

              Andreas pointed out, there is a benefit to having

              canonical examples present so that beginners can easily

              follow along with the many tutorials that have been

              written using them. As Sean points out, the paper itself

              is trying to pull out the connection between house price

              and clean air in the presence of possible confounding

              variables. In a more general sense, saying that a feature

              shouldn't be there because a simple linear regression is

              unaffected by the results is a bit odd because it is very

              common for datasets to include irrelevant features, and

              handling them appropriately is important. In addition, one

              could argue that having this type of issue arise in a toy

              dataset has a benefit because it exposes these types of

              issues to those learning data science earlier on and

              allows them to keep these issues in mind in the futur! <br>

            </blockquote>

          </blockquote>

        </blockquote>

        e! <br>

        <blockquote type="cite">Â Â  when the <br>

          Â Â  data is more serious. <br>

          <blockquote type="cite">

            <blockquote type="cite">It is important for us all to keep

              issues of fairness in mind when it comes to data science.

              I'm glad that you're speaking out in favor of fairness and

              trying to bring attention to it. <br>

              <br>

              Jacob <br>

              <br>

              On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <a

                class="moz-txt-link-rfc2396E"

                href="mailto:sean.violante@gmail.com"

                moz-do-not-send="true"><sean.violante@gmail.com></a>

              wrote: <br>

              G Reina <br>

              you make a bizarre argument. You argue that you should not

              even check racism as a possible factor in house prices? <br>

              <br>

              But then you yourself check whether its relevant <br>

              Then you say <br>

              <br>

              "but I'd argue that it's more due to the location (near

              water, near businesses, near restaurants, near parks and

              recreation) than to the ethnic makeup" <br>

              <br>

              WhichÂ  was basically whatÂ  the original authors wanted to

              show too, <br>

              <br>

              Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the

              demand for clean air', J. Environ. Economics &

              Management, vol.5, 81-102, 1978. <br>

              <br>

              Â  but unless you measure ethnic make-up you cannot show

              that it is not a confounder. <br>

              <br>

              The term "white flight" refers to affluent white families

              moving to the suburbs.. And clearly a question is

              whether/how much was racism or avoiding air pollution. <br>

              <br>

              <br>

              <br>

              <br>

              <br>

              On 6 Jul 2017 6:10 pm, "G Reina" <a

                class="moz-txt-link-rfc2396E"

                href="mailto:greina@eng.ucsd.edu" moz-do-not-send="true"><greina@eng.ucsd.edu></a>

              wrote: <br>

              I'd like to request that the "Boston Housing Prices"

              dataset in sklearn (sklearn.datasets.load_boston) be

              replaced with the "Ames Housing Prices" dataset (<a

                class="moz-txt-link-freetext"

                href="https://ww2.amstat.org/publications/jse/v19n3/decock.pdf"

                moz-do-not-send="true">https://ww2.amstat.org/publications/jse/v19n3/decock.pdf</a>).

              I am willing to submit the code change if the developers

              agree. <br>

              <br>

              The Boston dataset has the feature "Bk is the proportion

              of blacks in town". It is an incredibly racist "feature"

              to include in any dataset. I think is beneath us as data

              scientists. <br>

              <br>

              I submit that the Ames dataset is a viable alternative for

              learning regression. The author has shown that the dataset

              is a more robust replacement for Boston. Ames is a 2011

              regression dataset on housing prices and has more than 5

              times the amount of training examples with over 7 times as

              many features (none of which are morally questionable). <br>

              <br>

              I welcome the community's thoughts on the matter. <br>

              <br>

              Thanks. <br>

              -Tony <br>

              <br>

              Here's an article I wrote on the Boston dataset: <br>

              <a class="moz-txt-link-freetext"

href="https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D"

                moz-do-not-send="true">https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D</a>

              <br>

              <br>

              <br>

              _______________________________________________ <br>

              scikit-learn mailing list <br>

              <a class="moz-txt-link-abbreviated"

                href="mailto:scikit-learn@python.org"

                moz-do-not-send="true">scikit-learn@python.org</a> <br>

              <a class="moz-txt-link-freetext"

                href="https://mail.python.org/mailman/listinfo/scikit-learn"

                moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a>

              <br>

              <br>

              <br>

              _______________________________________________ <br>

              scikit-learn mailing list <br>

              <a class="moz-txt-link-abbreviated"

                href="mailto:scikit-learn@python.org"

                moz-do-not-send="true">scikit-learn@python.org</a> <br>

              <a class="moz-txt-link-freetext"

                href="https://mail.python.org/mailman/listinfo/scikit-learn"

                moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a>

              <br>

              <br>

              <br>

              _______________________________________________ <br>

              scikit-learn mailing list <br>

              <a class="moz-txt-link-abbreviated"

                href="mailto:scikit-learn@python.org"

                moz-do-not-send="true">scikit-learn@python.org</a> <br>

              <a class="moz-txt-link-freetext"

                href="https://mail.python.org/mailman/listinfo/scikit-learn"

                moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a>

              <br>

            </blockquote>

            _______________________________________________ <br>

            scikit-learn mailing list <br>

            <a class="moz-txt-link-abbreviated"

              href="mailto:scikit-learn@python.org"

              moz-do-not-send="true">scikit-learn@python.org</a> <br>

            <a class="moz-txt-link-freetext"

              href="https://mail.python.org/mailman/listinfo/scikit-learn"

              moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a>

            <br>

          </blockquote>

          _______________________________________________ <br>

          scikit-learn mailing list <br>

          <a class="moz-txt-link-abbreviated"

            href="mailto:scikit-learn@python.org" moz-do-not-send="true">scikit-learn@python.org</a>

          <br>

          <a class="moz-txt-link-freetext"

            href="https://mail.python.org/mailman/listinfo/scikit-learn"

            moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a>

          <br>

        </blockquote>

        <br>

        <br>

        _______________________________________________ <br>

        scikit-learn mailing list <br>

        <a class="moz-txt-link-abbreviated"

          href="mailto:scikit-learn@python.org" moz-do-not-send="true">scikit-learn@python.org</a>

        <br>

        <a class="moz-txt-link-freetext"

          href="https://mail.python.org/mailman/listinfo/scikit-learn"

          moz-do-not-send="true">https://mail.python.org/mailman/listinfo/scikit-learn</a>

        <br>

      </blockquote>

      <br>

    </blockquote>

    <br>

  </body>

</html>