By:
Andreas Stefik, Ph.D.
Assistant Professor, Department of Computer Science
University of Nevada, Las Vegas
Computer programming in K-12 education has become increasingly mainstream in the U.S., especially with the excellent work being conducted by groups like code.org, Project Lead the Way, CSTA, and many others. These trends have been happening for many reasons, not the least of which being a growing awareness of competition from abroad and the well known shortage of professionals in the field (a problem we are acutely aware of in Las Vegas). In this blog post, I will discuss work on the Quorum programming language, the first, so-called, evidence-oriented programming language. By our current count, Quorum is taught in approximately 34 K-12 schools in the U.S. From surveys, we estimate over 1,800 students will be taught Quorum next year.
In this blog, I’ll outline the origins of the Quorum project and evidence-oriented programming in general. More information can be found about Quorum at our website
(http://quorumlanguage.com/). Quorum has, to our knowledge, the first Hour of Code that is accessible through a screen reading device and a talk on the language was recently presented at Strange Loop and is available here: https://www.youtube.com/watch?v=uEFrE6cgVNY&feature=youtu.be&list=PLcGKfGEEONaCIl5eU53uPBnRJ9rbIH32R
Research on blindness and the origin of Evidence-Oriented Programming
The Quorum project began in approximately 2006 with an exploration I was conducting into the experiences of computer programmers who are blind or visually impaired. Initial interviews and discussions with professional blind programmers made it relatively obvious how many challenges the community faced. In many cases, mainstream tools (e.g., Visual Studio, NetBeans) were not particularly well suited, or in some cases did not work, for this community. More crucially, after discussing the issue with blind professionals, I suspected the challenges for K-12 students were likely worse, as learning Braille or a screen reader can be challenging for blind children. Consider for a moment why learning a programming language like C, on top of learning screen reading skills, might be difficult. While most individuals might see syntax like: for(int i = 0; i < 10; i++) {}
and be fine with it, this is typically translated into speech for the blind. While there is no perfect way to write this, readers are encouraged to say out loud “for left paren int i equals zero semicolon i less than ten semicolon i plus plus right paren left brace right brace.” Similarly, compiler errors when translated to audio can be difficult to understand. A trivial error (e.g., missing a variable definition in C++), might take a minute or two to “listen” to. Mainstream students have difficulty with these things too, but they are exacerbated in students with disabilities.
After making these observations, I concluded that one obvious way to approach making programming easier was to improve the tools (e.g., talking debuggers, accessible editor hints, accessible completion, accessible code folding). I invented all of those things (most are in Sodbeans, a tool for the blind used throughout the U.S. today), and they do seem to help, but C itself is still hard to “listen to.” Ultimately I couldn’t get this nagging question out of my head, “What was the evidence for the design of C to begin with?” So I began to investigate, but weirdly … I could find little human factors evidence in the academic literature.
Historical Context on Evidence Standards:
A slight detour is necessary to help readers who are not familiar with historical evidence gathering techniques. Giving credit where due, I highly recommend reading a paper by Ted Kaptchuk on the history of blind assessment, which influenced my thinking: http://media.virbcdn.com/files/5f/FileItem-260254-Kaptchuk_IntentIgnor_BulHisMed1998.pdf
Let’s briefly discuss this history. First, it was not always the case that medical practitioners, psychologists, and others used experiments in science. In medicine, the techniques started largely in the late 18th century after Louis XVI created a commission to study the bogus theory of animal magnetism (mesmerism). Benjamin Franklin, the head of the commission, used the idea of “sham treatments” to uncover problems with this (bogus) theory.
As science progressed, techniques became increasingly rigorous. Scientists used primitive placebo tests as early as 1834, although they still had no concept of experimental design. By the late 19th century, psychology was developing and we start to see the use of randomization, at first to explore sensory perception and to evaluate (bogus) supernatural claims like those made by psychics (e.g., talking to the dead).
Experimentation moved forward again in pharmacology, in part because of Charles Edouard Brown-Sequard’s claims that testicular extract from guinea pigs would “rejuvenate mental and physical health.” The scientific community, now increasingly skeptical of claims made without evidence, started still rather primitive assessments within a few months. Finally, by the mid-1930s, we obtained what many scientists think of today as the randomized controlled trial, made possible largely by Ronald Fisher in his famous, “The Design of Experiments.” While not mentioned by Kaptchuk, experimental design today has made other important leaps forward, most of which are unfortunately not internalized in computer science like they are in other disciplines (e.g., studies on replication, registered randomized controlled trials). Walter Tichy has probably discussed this more than most.
Evidence-Oriented Programming
Back to blindness and programming, consider observations that children who are blind had extraordinary difficulties with programming languages, and also that the academic literature was sparse with human factors data on programming languages. Certainly, there exists some papers on the topic (e.g., ESP, PPIG, Plateau, a small number of trials in the 70s and 80s), but the number of randomized controlled trials is small. Programming languages make up the entire foundation of modern software — effectively every computing device that everyone owns is built with one. Yet, the top conferences in programming languages used rigorous proofs to determine if features worked, and had significant empirical data to evaluate performance, but used anecdotes for human factors. I didn’t understand why this would be (yet), but by 2009, we started investigating ourselves, largely in this paper:
Stefik and E. Gellenbeck. Empirical studies on programming language stimuli. Software Quality Journal, 19(1):65-99, 2011. 10.1007/s11219-010-9106-7.
Some problems became obvious. Surveys showed words like “for, while, or foreach” were, in a bizarre and unexpected twist of irony, the three least intuitive choices for people in our sample (which we later replicated). This is ironic because these choices are common across a large number of programming languages.
Surveys only tell us so much though, so we went further, which is to take from the medical playbook, using sham (or dummy) treatments. In the first test, we used what I called a “Placebo Language,” basically a randomly designed programming language where symbols are chosen from the ASCII table (one could imagine other placebo languages). We sent the first of these tests to a small workshop called Plateau in 2011:
Andreas Stefik, Susanna Siebert, Melissa Stefik, Kim Slattery. An Empirical Comparison of the Accuracy Rates of Novices using the Quorum, Perl, and Randomo Programming Languages. Workshop on the Evaluation and Usability of Programming Languages and Tools (PLATEAU 2011). Portland, OR, October 24th, 2011.
The result, at least to me, was an eye opener. Perl, despite the fact that it was a very popular programming language with significant community support, was apparently not detectably easier to use for novices than a language that my student at the time, Susanna Kiwala (formerly Siebert), created by essentially rolling dice and picking (ridiculous) symbols at random. That result was astonishing and I frankly wasn’t sure I believed it. So, we ran a replication with three additional languages. The result was the same, replicating with extreme accuracy on a new sample (Perl within less than a percent). That work was published here:
Andreas Stefik and Susanna Siebert. 2013. An Empirical Investigation into Programming Language Syntax. ACM Transactions on Computing Education 13, 4, Article 19 (November 2013), 40 pages.
Interestingly, in this new study, the same result we saw with Perl we also observed with Java, a programming language so popular it is used in the Computer Science A – AP test in high school. Ultimately, using a technique we call Token Accuracy Mapping, which is basically a way to figure out which tokens may have caused the problems, it appeared that C-style syntax was plausibly the culprit. At the same time, using this new technique, we found a variety of problems with Quorum and borrowed from languages where the evidence showed they had a better design (e.g., Ruby’s if statement design, with the exception of the equality syntax ==, was integrated into Quorum 1.7).
Again, this finding surprised us. At this point, while there is a lot more to learn, we knew there was a problem and our strong suspicion was that the language community had not been gathering rigorous evidence on human factors, basically in its entire history, which we have since confirmed is an unfortunate truth. We hypothesize that this lack of evidence may be one of the leading causes of the so-called “programming language wars,” which we discuss at length in the literature:
Andreas Stefik and Stefan Hanenberg. 2014. The Programming Language Wars: Questions and Responsibilities for the Programming Language Community. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software (Onward! 2014). ACM, New York, NY, USA, 283-299.
In other words, if language designers are not even gathering evidence on human factors, then it’s not surprising that they seem to disagree about their designs or that students might needlessly struggle. In essence, language designers prove that their features work, and show evidence that they are “fast,” but they never seem to ask, in a scientifically defensible way, “What is the impact on various kinds of people?”
Creating Quorum:
As these studies progressed over time, and as we worked more with various communities, my wife and I decided that if we wanted to program using a language that used human factors evidence as a guide, we would have to build it ourselves. So, we developed the first Evidence-Oriented Programming language, which we called Quorum. Here are the basic ideas of this paradigm as I see it today:
- Design features of a language must have corresponding evidence from appropriately constructed randomized controlled trials for as much as is possible
- The language authors must allow the external community to suggest changes according to the scientific method, inspired by lessons from history
- Evidence from other techniques (e.g., software repository mining, educational data) if gathered from humans and rigorous is allowed
Now to be clear, Evidence-Oriented Programming is a paradigm, not a language, and the scholar Stefan Hanenberg in Germany thought it up at basically the same time. He came to the same conclusions I did for different reasons, with different experiments, on a different part of the globe (his story on the topic is very different from mine and fascinating). Others are fleshing it out better than I have, especially the scholar Antti-Juhani Kaijanaho, who has created what is perhaps the most systematic review of the history of the programming language wars ever.
More crucially, it is a paradigm because the evidence on various design decisions (which I won’t get into here, but have written about extensively) is tricky to understand and requires significant statistical expertise to grasp. It also appears that some design decisions have trade-offs, the most well known of which to-date is that whether functions in programming contain type information (e.g., an integer, a number) appears to improve productivity of those past the third year of experience in college, but has a small, but non-zero, negative impact on novices at around a freshman year of college (depending on where in a program it is).
In other words, I can imagine one language designer using evidence to optimize human factors for one group, while another optimizing for another, with both taking all known controlled trials into account and therefore having some similarities in their design. As a simple example, a block language designer hypothetically designing for a child’s toy or learning might use the word “repeat” for iteration, which the evidence supports, yet a text-based language designer might use the same word when designing for industrial scale robotics or a NASA rocket where blocks might be impractical and where efficiency of the language is crucial (or some other reason). This type of thinking could provide increased consistency amongst language designs in the years to come, with an increasingly stronger foundation of evidence.
In any case, evidence-oriented programming is not a panacea, nor does it imply we will eventually get to the unlikely idea of “one language to rule them all.” Plausibly, what we are observing today is a paradigm shift in language design, going from next to no evidence at all on human factors toward an unwillingness of young scholars like myself to accept anecdotes or prestige as fact, for a basically similar reason Benjamin Franklin didn’t in the 18th century. No one knows where the evidence-oriented movement will end up, perhaps with some complex hybrid of blocks or visualization, or maybe with different domains having various kinds of approaches, or maybe even partial standardization of some aspects of language design within a century (or two). Wherever it ends up, the goal is to follow the evidence wherever it leads in order to make the next generation of programming technologies just as practical at industry scale, while also being easier to understand and use.
While Quorum began as a toy exclusively to help blind children learn computer science, it has grown in unexpected ways. While it is still used across the U.S. at around half of schools for the blind and visually impaired, about half of teachers that came to the 6th annual Experience Programming in Quorum workshop in Vancouver, WA represented mainstream schools. Quorum is a Java Virtual Machine Language, so it will run on most machines, and has a variety of libraries, including those for creating computer games, LEGO robotics support, and more. Quorum is accessible for people with disabilities, including those with visual impairments, and supports a robust and accessible development environment based on Oracle’s NetBeans. Since the language is built on industrial strength technologies, it can be used not just in K-12, but potentially in college or industry as well, for effectively any general purpose application. Finally, Quorum is under the BSD license (free for any use), and all curriculum is free under the Creative Commons License. Our curriculum goes through continuous improvement by dedicated K-12 teachers and is being mapped to Computer Science Principles and, hopefully soon, the common core. More information can be found at: http://quorumlanguage.com/
Pingback: Four short links: 2 February 2016 - O'Reilly Radar
C wasn’t designed as a teaching/learning language. It was designed by experts as a replacement for assembly language at a time when computer resources were costly (memory, secondary storage, CPU cycles) and input methods were crude (Teletype terminals). More recent languages adopted much of its syntax and its keywords because they were familiar to the established corps of programmers. Recent language design has focused on provable correctness rather than human factors, which I’ve thought was putting the cart before the horse. Have you looked at the design history of the Smalltalk language? Its designers started out with the aim of creating a language suitable for teaching children how to program, although I don’t know what experiments they conducted and/or how rigorously designed they were, to arrive at the features of Smalltalk