Something Old, Something New

I’ve been programming since 1972 (junior year of high school, long before programming in high school was considered a normal course of events). Over the years I’ve programmed in BASIC, FORTRAN, LISP, PL/1, Scheme, Prolog, C, C++, Pascal, Python, various other languages here and there when I took Programming Languages and AI Programming (CLU anyone?). But even with that long laundry list of languages, reasons still arise to learn more.
So I have spent the past few weeks learning R. If you aren’t familiar with R, it’s a language designed for statistical computing and graphics. Why, you might ask, would I bother? I’m working with some of my faculty colleagues from Economics and Political Science. We have purchased a very large database of Chinese customs data. Millions of records of trade transactions at the level of “on such and such a day 25 pairs of sneakers were exported by this company through that port on their way to this other company in this other country”. Oh yeah, we’ve got imports too. Of course, my colleagues want all kinds of aggregation. It seems that customs data comes with things called HS Codes (Harmonized System Codes for commodity classification). The first two digits get you into high level categories like Foodstuffs or Textiles. But then you can dig deeper. Was that shipment wool, or was it fine or course animal hair?
I could see how I’d compute in Python what my colleagues wanted using tons of loops and complicated data structures. But that seemed unwieldy considering that they are interested in 2-digit and 4-digit HS codes so I could be dealing with 1000 buckets of aggregation. But R is completely designed for dealing with data frames and data imported from CSV files. It can do quickly, with one line of code, the kinds of things that I would otherwise have to write a lot of loops for. On the other hand, getting started is not for the faint of heart. I made some quick headway, for sure, enough that my colleagues could use preliminary results to better articulate what they really wanted. And then I realized that my baby skills were not adequate, that I was going to have to do some serious learning.
Two books, plenty of Googling for examples, some downloaded PDFs. But I also told myself what I always tell my students. The joy of working in an interpreted language is that you can try things out, get quick feedback, and move on accordingly. So just when frustration was reaching high levels, I started going step by careful step. After a lot of trial and error, I came up with a new approach for the part that had me stumped, and then I wrote a little inline code to check the general idea behind my approach. That worked. Then I moved that code into a function. Then I made the function a little more complicated, and a little more complicated, so that eventually it fully executed step one of my new approach. Then I wrote code that completed what I needed. Little step by little step. Then I figured out how to write the results out to a new CSV file so my colleagues and their students can review the results. I’m still waiting to hear back, since I don’t want to write any more code until I’m sure I’m delivering the correct things to them. But I was every bit as joyous when my code ran correctly as my intro students are when they get code to run. And it served as a reminder of the value of thoughtful painstaking problem solving and thoughtful painstaking implementation. As I said to one of my colleagues, looking at the code, which is just over 50 executable lines, you’d never guess how long it took to write. But if I had written it faster it would have been 3 times longer and probably 10 times less efficient. Now that I have it running for 10,000 data records, can’t wait to see how it fares on the larger data sets!
Valerie Barr
Computational Thinking Task Force chair