Abstract: Artificial intelligence (AI) is automated decision-making, and it builds on quantitative methods which have been pervasive in our society for at least a hundred years. This essay reviews the historical record of quantitative and automated decision-making in three areas of our lives: access to consumer financial credit, sentencing and parole guidelines, and college admissions. In all cases, so-called “scientific” or “empirical” approaches have been in use for decades or longer. Only in recent years have we as a society recognized that these “objective” approaches reinforce and perpetuate injustices from the past into the future. Use of AI poses new challenges, but we now have new cultural and technical tools to combat old ways of thinking.
Introduction
Recently, concerns about the use of Artificial Intelligence (AI) have taken center stage. Many are worried about the impact of AI on our society.
AI is the subject of much science fiction and fantasy, but simply put, AI is automated decision-making. A bunch of inputs go into an AI system, and the AI algorithm declares an answer, judgment, or result.
This seems new, but quantitative and automated decision-making has been part of our culture for a long time—100 years, or more. While it may seem surprising now, the original intent in many cases was to eliminate human bias and create opportunities for disenfranchised groups. Only recently are we recognizing that these “objective” and “scientific” methods actually result in reinforcing the structural barriers that underrepresented groups actually face.
This essay reviews our history in three areas in which automated decision-making has been pervasive for many years: decisions for awarding consumer credit, recommendations for sentencing or parole in criminal cases, and college admissions decisions.
Consumer credit
The Equal Credit Opportunity Act, passed by the U.S. Congress in 1974, made it unlawful for any creditor to discriminate against any applicant on the basis of “race, color, religion, national origin, sex, marital status, or age” (ECOA 1974).
As described by Capon (1982), “The federal legislation was directed largely at abuses in judgmental methods of granting credit. However, at that time judgmental methods that involved the exercise of individual judgment by a credit officer on a case-by-case basis were increasingly being replaced by a new methodology, credit scoring.”
As recounted by Capon, credit scoring systems were first introduced in the 1930s to extend credit to customers as part of the burgeoning mail order industry. With the availability of computers in the 1960s, these quantitative approaches accelerated. The “credit scoring systems” used anywhere from 50 to 300 “predictor characteristics,” including features such as the applicant’s zip code of residence, status as a homeowner or renter, length of time at present address, occupation, and duration of employment. The features were processed using state-of-the-art statistical techniques to optimize their predictive power, and make go/no-go decisions on offering credit.
As Capon explains, in the years immediately after passage of the ECOA, creditors successfully argued to Congress that “adherence to the law would be improved” if these credit scoring systems were used. They contended that “credit decisions in judgmental systems were subject to arbitrary and capricious decisions” whereas decisions made with a credit scoring system were “objective and free from such problems.”
As a result, Congress amended the law with “Regulation B” which allowed the use of credit scoring systems on the condition that they were they were “statistically sound and empirically derived.”
This endorsed companies’ existing use of actuarial practices to indicate which predictor characteristics had predictive power in determining credit risk. Per Capon: “For example, although age is a proscribed characteristic under the Act, if the system is statistically sound and empirically derived, it can be used as a predictive characteristic.” Similarly, zip code, a strong proxy for race and ethnicity, could also be used in credit scoring systems.
In essence, the law of the United States ratified the use of credit scoring algorithms that discriminated, so long as the as the algorithms were “empirically derived and statistically sound”—subverting the original intent of the 1974 ECOA law. You can read the details yourself—it does actually say this (ECOA Regulation B, Part 1002, 1977).
Of course, denying credit, or offering only expensive credit, to groups that historically have had trouble obtaining credit is a sure way to propagate the past into the future.
Recommendations for sentencing and parole
In a deeply troubling, in-depth analysis, ProPublica, an investigative research organization, showed how a commercial and proprietary software system is being used to make parole recommendations to judges for persons who have been arrested is biased (Angwin et al., 2016).
As ProPublica reported, even though a person’s race/ethnicity is not part of the inputs provided to the software, the commercial software (called COMPAS, as part of the Northpointe suite) is more likely to predict a high risk of recidivism for black people. In a less well-publicized finding, their work also found that COMPAS was more likely to over-predict recidivism for women than men.
What was not evident in the press surrounding the ProPublica’s work is that the US has been using standardized algorithms to make predictions on recidivism for nearly a century. According to Frank (1970), an early and classic work is a 1931 study by G. B. Vold, which “isolated those factors whose presence or absence defined a group of releasees with a high (or low) recidivism rate.”
Contemporary instruments include the Post Conviction Risk Assessment, which is “a scientifically based instrument developed by the Administrative Office of the U.S. Courts to improve the effectiveness and efficiency of post-conviction supervision” (PCRA, 2018); the Level of Service (LS) scales, which “have become the most frequently used risk assessment tools on the planet” (Olver et al., 2013); and Static-99, “the most commonly used risk tool with adult sexual offenders” (Hanson and Morton-Bourgon, 2009).
These instruments have undergone substantial and ongoing research and development, with their efficacy and limitations studied and reported upon in the research literature, and it is profoundly disturbing that commercial software that is closed, proprietary, and not based on peer-reviewed studies is now in widespread use.
It is important to note that Equivant, the company behind COMPAS, published a technical rebuttal of ProPublica’s findings, raising issues with their assumptions and methodology. According to their report, “We strongly reject the conclusion that the COMPAS risk scales are racially biased against blacks” (Dieterich et al., 2016).
Wherever the truth may lie, the fact that the COMPAS software is closed source prevents an unbiased review, and this is a problem.
College admissions decisions
At nearly one hundred years old, the SAT exam (originally known as the “Scholastic Aptitude Test”) is a de facto national exam in the United States used for college admission decisions. In short, it “automates” some (or much) of the college admissions process.
What is less well-known is that the original developers of the exam intended it to “level the playing field”:
When the test was introduced in 1926, proponents maintained that requiring the exam would level the playing field and reduce the importance of social origins for access to college. Its creators saw it as a tool for elite colleges such as Harvard to use in selecting deserving students, regardless of ascribed characteristics and family background (Buchmann et al., 2010).
Of course, we all know what happened. Families with access to financial resources hired tutors to prep their children for the SAT, and whole industry of test prep centers was born. The College Board (publisher of the SAT) responded in 1990 by renaming the test to be the Scholastic Assessment Test, reflecting the growing consensus that “aptitude” is not innate, but something that can be developed with practice. Now, the test is simply called the SAT—a change which the New York Times reported on with the headline “Insisting it’s nothing” (Applebome, 1997).
Meanwhile, contemporary research continues to demonstrate that children’s SAT scores correlate tightly with their parent’s socioeconomic status and education levels (“These four charts show how the SAT favors rich, educated families,” Goldfarb, 2014).
The good news is that many universities now allow students to apply for admission as “test-optional”; that is, without needing to submit SAT scores or those from similar standardized tests. Students are evaluated using other metrics, like high school GPA, and a portfolio of their accomplishments. This approach allows universities to admit a more diverse set of students while evaluating they are academically qualified and college-ready.
What are the takeaways?
There are three main lessons here:
1. Automated decision-making has been part of our society for a long time, under the guise of it being a “scientific” and “empirical” method that produces “rational” decisions.
It’s only recently that we are recognizing that this approach does not produce fair outcomes. Quite to the contrary: these approaches perpetuate historical inequities.
2. Thus today’s use of AI is a natural evolution of our cultural proclivities to believe that actuarial systems are inherently fair. But there are differences: (a) AI systems are becoming pervasive in all aspects of decision-making; (b) AI systems use machine learning to evolve their models (decision-making algorithms), and if those decision-making systems are seeded with historical data, the result will necessarily be to reinforce the structural inequities of the past; and (c) many or most AI models are opaque—we can’t see the logic inside of them used to generate decisions.
It’s not that people are intentionally designing AI algorithms to be biased. Instead, it’s a predictable outcome of any model that’s trained on historical data.
3. Now that we are realizing this, we can have an intentional conversation about the impact of automated decision-making. We can create explicit definitions of fairness—ones that don’t blindly extend past injustices into the future.
In general, I am an optimist. Broadly, technology has vastly improved our world and lifted many millions of people out of poverty. Artificial Intelligence is presently being used in many ways that create profound social good. Real-world AI systems perform early, non-invasive detection of cancer, improve crop yields, achieve substantial savings of energy, and many other wonderful things.
There are many initiatives underway to address fairness in AI systems. With continued social pressure, we will develop technologies and and a social contract that together creates the world we want to live in.
Acknowledgments: I am part of the AI4K12 Initiative (ai4k12.org), a joint project of the Association for the Advancement of Artificial Intelligence (AAAI) and the Computer Science Teachers Association (CSTA), and funded by National Science Foundation award DRL-1846073. We are developing guidelines for teaching artificial intelligence in K-12. With my collaborators, I have had many conversations that have contributed to my understanding of this field. I most especially thank David Touretzky, Christina Gardner-McCune, Deborah Seehorn, Irene Lee, and Hal Abelson, and all members of our team. Thank you to Irene and Hal for feedback on a draft of this essay. Any errors in this essay are mine alone.
References
Applebome, P. (1997). Insisting it’s nothing, creator says SAT, not S.A.T. The New York Times, April 2. Retrieved from https://www.nytimes.com/1997/04/02/us/insisting-it-s-nothing-creator-says-sat-not-sat.html.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. ProPublica, May 23. Retrieved from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Buchmann, C., Condron, D. J., & Roscigno, V. J. (2010). Shadow education, American style: Test preparation, the SAT and college enrollment. Social forces, 89(2), 435–461.
Capon, N. (1982). Credit scoring systems: A critical analysis. Journal of Marketing, 46(2), 82–91.
Datta, A., Tschantz, M. C., & Datta, A. (2015). Automated experiments on ad privacy settings. Proceedings on privacy enhancing technologies, 2015(1), 92–112.
Dieterich, W., Mendoza, C., & Brennan, T. (2016). COMPAS risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc. Retrieved from http://go.volarisgroup.com/rs/430-MBX-989/images/ProPublica_Commentary_Final_070616.pdf.
ECOA (1974). Equal Credit Opportunity Act, 15 U.S. Code § 1691. Retrieved from https://www.law.cornell.edu/uscode/text/15/1691.
Frank, C. H. (1970). The prediction of recidivism among young adult offenders by the recidivism-rehabilitation scale and index (Doctoral dissertation, The University of Oklahoma).
Goldfarb, Z. A. (2014). These four charts show how the SAT favors rich, educated families. The Washington Post, March 5. Retrieved from https://www.washingtonpost.com/news/wonk/wp/2014/03/05/these-four-charts-show-how-the-sat-favors-the-rich-educated-families/.
Hanson, R. K., & Morton-Bourgon, K. E. (2009). The accuracy of recidivism risk assessments for sexual offenders: a meta-analysis of 118 prediction studies. Psychological assessment, 21(1), 1.
PCRA (2018). Post Conviction Risk Assessment. Retrieved from https://www.uscourts.gov/services-forms/probation-and-pretrial-services/supervision/post-conviction-risk-assessment.