On Education Studies and the Problem of High Dimensions: A Plea for Wisdom and Judgment

As I think and read about various educational approaches (in the process of preparing for the course I’m teaching in the fall about math and writing), I sometimes see studies with conclusions like this:

Students who were taught using method A significantly outperformed students taught using method B.

Sometimes, method A is the new and exciting approach to education, and method B is the old approach. Other times, it’s switched, with a contrarian study showing that the new-fangled approach is less good than the traditional method.¹ In either case, the implication of the study is clear: method A is much better than method B.

But there’s a problem here. All this study shows is that one specific instantiation of teaching in method A worked better than one specific instantiation of method B.

Yes, controlled scientific studies are important.² But in most scientific research, you’re looking at something much more straightforward.

Let’s consider a scientific study that tests the effect of some drug on human memory. For a first attempt at testing this scientifically, I’d take my subjects, split them randomly into two pools, and give one group the drug and another group a placebo. Then I’d look at how they do on memory tests.

If the group with the drug does better than the control, then I can conclude that this drug at that specific dosage improves human memory. And if that group doesn’t do better, all I can conclude is that the drug at that specific dosage doesn’t improve human memory (at least as I measure it). But it’s possible that the dosage was the problem: I might run different studies, and eventually find that the right dosage does have a positive effect.

What does my variable look like? My experimental variable is amount of drug. Let’s assume, for now, that the drug is always given once a day, at the same time. Then this variable is one-dimensional: it consists of a single number, the dosage. To test the variable, we must test each dosage level. Of course, there are infinitely many possible doses, but we can make this finite by (a) assuming that all doses are between 0 and, say, 100mg (or whatever a reasonable upper limit of the drug would be); and (b) discretizing the possible doses, by considering only, for example, 0, 10, 20, …, 100. If we do this, there are only 11 possible doses.³

That’s a lot of things to test. But what if we also varied the frequency of the doses? We let some subjects receive the full dose each day, but also let others receive it spread into 2 doses taken twice a day, 4 doses taken 4 times a day, or 8 doses taken 8 times a day. Well, now we have a two-dimensional space of possibilities: there are four different frequency possibilities for every dosage amount. So now we have 4 x 11 = 44 things to test.

Now imagine if we had even more possibilities: What if the drug could be given both in pill form, or by intravenous injection? Then there are twice as many possibilities to consider: 88.

And so on. What is happening is the variable is becoming higher-dimensional. When this happens, to make a robust statement about the variable, we have to test many more things.

Now let’s suppose we have two different drugs, A and B, and we want to see which works best. We know that both drugs have different effects based on dosage, frequency, and pill/IV delivery. We run a single experiment, giving random groups one drug or the other. Let’s say that subjects receiving drug A do better.

What does this mean? It means only that A at a certain dosage, frequency and delivery is better than B at a certain dosage, frequency and delivery.

If we want to figure out whether A is better than B in general, then we have to run many more tests: We have to consider all 88 possibilities for A, and all 88 possibilities for B. And of course for each of these trials, we have to run it on many subjects.

All of a sudden, these trials become prohibitively expensive.

In real life, of course, you don’t have to do all these studies. If a drug works at 50mg and 80mg, it probably works fine at 60mg and 70mg. It’s probably also fine just to run a few tests comparing pill versus IV; we don’t need to test every single combination of delivery and dosage level. In other words, there are patterns among these possibilities, which adds structure to this higher-dimensional space, allowing us to reduce the number of studies we need to do. But this is still a simple example: I could add many more complications just in the biomedical setting. (For example, for how long do the subjects use the drug? What if they have to use several drugs in combination? What times of day do they take the drugs?)

Now let’s go to the messy real world. Let’s say you want to test whether traditional lecturing or flipped classrooms works better for teaching students math. (And let’s simplify matters tremendously by assuming that there is a single objective test for determining how well students learn the material.) This is analogous the situation with drugs A and B.

How many dimensions do we have here? The number is vast beyond all human comprehension. Let me first start just by listing some of the simplest dimensions:

How many students are in the class.
What order to teach the topics in.
Whether to use calculators or not.
How long to meet each week.
How much homework to assign students.
Which book to use (say, from a list of 10 books).
Whether to focus on proofs or applications or examples (or some combination)
Whether to have quizzes or not
etc

But these aren’t even the tip of the iceberg. Who are your instructors going to be? The possible characteristics of a given teaching style are essentially infinite. Even if you want to reduce them to some finite set of characteristics—analogous to how the Myers-Briggs test reduces the manifold varieties of human personality to just four dimensions—you have something incredibly vast.

So what are you going to conclude when you run a study and discover, say, that flipped classrooms work better than lectures? What if the specific instructors in your study have personalities better adapted to teaching in flipped classrooms? (Or perhaps they’re much more motivated to try harder with this new educational technique.) What if you use calculators for both classes, but it turns out that calculators work well only with one method? What if the class size you’re using plays a bit role? And so on.

What I’m saying is this: these studies are rarely robust. Because of the vast possibilities you’d have to test, it’s effectively impossible to conclude scientifically that one approach is necessarily better than another.

What then? Should we give up on these studies? Is it all futile?

Certainly not! First, there’s no doubt that it’s important that we continue experimenting with different techniques. And then, yes, it is crucial that we consider the evidence for the efficacy of these techniques—especially if we can do controlled, quantitative experiments. (Don’t get me started on poorly designed correlational studies, which commit far more egregious sins in drawing conclusions from their data!)

But to conclude “it’s been shown scientifically that students learn better when they {use groupwork, aren’t taught grammar, are taught grammar, use calculators, don’t use calculators, learn in flipped classrooms, etc.}” is to misunderstand science.⁴

Instead of inappropriately invoking the name of science, we need to draw upon our very humanistic capabilities of judgment, evaluation, analysis, wisdom. We need the qualitative as much as the quantitative. No, don’t throw away the quantitative—certainly don’t think we can analyze these issues without also understanding the quantitative, both its powers and its limitations. Yes, we should do these studies. But we need to examine them closely: what are they saying? Are they robust? Do the results coincide with our intuitions? What do our qualitative observations show? (For, yes, we need to do qualitative studies, too—studies which must be done by the right people.)

And we must also face the reality that judgment and wisdom are not as easily measured as quantitative prowess. There’s no test to determine who can read Shakespeare most incisively, who can offer profound wisdom on the human condition, who can offer the best commentary on current affairs. No, instead we have to hope for a society—in academia, in the world of arts and letters, may we even dare to dream of it in the world of public policy and politics?—where the wise can rise to the top, and we can trust them.

Postscript: My comments here are focused on pedagogy research, but they of course apply more widely in the social sciences. Also, there are many more issues to discuss, most notably the fact that different subsets of the population might respond better to different techniques. (This is a huge issue in biomedical research, as well!)

Let me note that I’m going to stay agnostic on all of these educational approaches. It so happens that many of the studies tend to be of the former camp, and so it may seem that I’m criticizing these innovations. That is not what I’m doing. As my post is suggesting, we can evaluate these approaches only by careful analysis; few of us have done enough to make the sweeping, definitive conclusions one hears about. ↩
Here I’m assuming that they’ve actually designed things right, randomly selecting the students in the different pools, and having a large enough sample size. (Note that you have to think about several different categories where you need large enough sample sizes: not just students but classes, teachers, schools, etc. After all, if a class of 20 students has one trouble-maker, that affects the other students, so their results are not independent.) I’m sure many studies don’t even meet these criteria. ↩
We get 11 because we’re counting both 0 and 10. ↩
There’s one case where it might be okay to make such a statement. If you’re teaching with one of the methods, then lying to your students—as I’m arguing here, it is really lying—about the science behind the method might be a justifiable pedagogical technique. For example, when I teach standard calculus courses at Michigan, where we use groupwork in class, I’ll say, “We’re doing lots of groupwork because it’s been scientifically shown that groupwork helps you learn better.” This seems okay because it will increase my students’ buy-in to the method; that justifies the white lie in my mind. (It so happens that I do think that groupwork is well-suited to this style of calculus class, certainly better than straight lecturing—although I’d argue rather vehemently that this isn’t the type of math course we should be teaching freshmen in the first place.) ↩