I’ve been neurotically interested in psychology research reform for the past few years. Recently, I’ve become very curious about some of the proposals about reporting statistics. Cruising through my Twitter feed I stumbled upon Geoff Cumming’s article The New Statistics: Why and How. The article is a great summary of the guidelines proposed to help better our science.
I don’t plan to detail the article point by point, but I do want talk about one of his graphs and then explain how his illustration was useful for my own research questions.
Figure 1. Simulation graph from Cumming (2013) that I don’t have permission to reproduce.
In the section that includes this figure, Dr. Cumming quickly explains why Null Hypothesis Significance Testing (NHST) should not be used to confirm research findings. To illustrate this, he created a simulation consisting of 25 repetitions of a simple experiment with 2 independent samples. The difference between the populations is represented by the line, the circles represent different samples, and the p-values are on the left of the figure. As you can see, the circles and the confidence intervals dance around, and p-values are rarely significant. For my purposes, I want you to eyeball how often the 95% confidence interval below its predecessor’s (its replication) captures the next experiment’s mean difference, and then compare that count to how often a significant p-value is followed by another significant p-value.
To introduce my problem, I will only describe the design of a thrice replicated experiment because the research is not mine to share. In each experiment, participants received 1 of 2 treatments, or no treatment, and then they responded to a 1-item scale from 1-100 that would serve as a manipulation check for the treatments. Though the treatments were different, each was predicted to produce lower responses on the scale compared to those responses following the neutral condition. An Analysis of Variance revealed a significant difference in responses between conditions in the first experiment, but not in the second experiment. This is the problem that I want to discuss.
When I inspected the means of all 3 conditions in each experiment, I could see that they were different and why (I think) the results did not “reach significance.” My first question was, “Well, are the means really, significantly different than their corresponding means from first experiment, even if they aren’t significantly different from each other in the second experiment?” In other words, is the mean response for the neutral condition in experiment 2 significantly different than the the mean response for the neutral condition in experiment 1? What about the treatment condition means between experiments?
To test this, I took each of the means, standard deviations and sample sizes from the 3 conditions in both experiments and ran t-tests. To make extra damn sure, I made the sample sizes equal by randomly sampling the smallest group’s sample size from each of the larger samples. All the p-values were too big. Have I answered my question? Are the means statistically different? By convention, I failed to reject the null hypothesis: there is no difference between the means. Yes, I have an answer.
By this point, I felt a little better that the means weren’t statistically different, but this lead me to ask, “So does this imply that these means are expected from these populations?” In other words, if we had the grant money to run this experiment over and over, to the tune of, say, thousands of times, would we expect the means from experiments 1 & 2 to show up in one of our replications?
To visualize this, I created 95% confidence intervals around the means of all 3 conditions from not only the 2 experiments that I ran, but a third experiment ran by another researcher.
Figure 2. Control, Treatment 1 and Treatment 2 means from 3 experiments (mine are the 2nd and 3rd of each set)
You can see that confidence intervals from previous experiments capture means in the preceding ones. All is right with the world because most of the time (a little over 80% according to Cumming) a confidence interval from the previous experiment will capture the mean in the next. If I run a version of this experiment again, I’m sure that my statistical fishing nets you see here will be wide enough to snag the new means. What I hope I’ve demonstrated here is that if experiment 2 (the third mean from each condition on the graph above) was my first experiment, and if I relied on significance to make a conclusion about the differences between these conditions, then I might not run another experiment. I might conclude that these conditions do not produce meaningfully different responses on my scale, when maybe they do.
Social science has made great leaps and bounds over the past century, but it isn’t perfect. As a freshly minted undergraduate, I can assure you that students will be taught the dogma of NHST for years to come. This should change. If, as scientists, we pride ourselves in our methods for discovering truth, we should take opportunities sooner rather than later to refine those methods. Otherwise, we’re just sitting in armchairs I guess.