Course website:
Show (and discuss) the “plumbing”
David Donoho “50 years of data science”
Donoho reminds us that all that is new is old again (or vice versa?).
Quoting Tukey:
For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. …All in all I have come to feel that my central interest is in data anal- ysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data”
And, also:
Four major influences act on data analysis today:
1. The formal theories of statistics
2. Accelerating developments in computers and display devices
3. The challenge, in many fields, of more and ever larger bodies of data.
4. The emphasis on quantification in an ever wider variety of disciplines.
This idea had a huge influence. We’ll come back to it when we discuss machine learning in a few weeks.
The common task framework (CTF; Liberman 2010)
Netflix prize
ImageNet
DARPA Grand Challenges
BRATs
Fibercup phantom
Halevy, Norvig and Pereira, all Google researchers with background in AI research
In analogy to “The unreasonable effectiveness of mathematics in the natural sciences” (Wigner, 1960)
Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.
The paradigmatic example: learning the implicit structure of language from large corpuses.
In short, information technology skills are at the heart of the qualifications needed to work in predictive modeling. These skills are analogous to the laboratory skills that a wet-lab scientist needs to carry out experiments. No math required.
See also Cleveland (2001) “Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics”.