I want to clarify and expand upon my main point from my last blog post “New” Statistics. This blog is a place for science almost as much as it’s a place for me to practice more concise writing.
So in my post, I talked about how I was worried that the results of a second experiment were “different” than the results from the first experiment. I was worried because the p-value wasn’t sufficiently small. I addressed my worries in two ways that yielded, in effect, the same results. First, I ran t-tests of the differences between the means of the two experiments. Since the p-values were insignificant, I concluded that the means weren’t different. Second, I used a bootstrapping technique to demonstrate that the means from both experiments fell within 95% confidence intervals. In effect, I did the same thing in my second analysis because, by definition, for means to fall outside of confidence intervals is to reach significance.
My goal was to illustrate a difference in the framing of analyses and research results. The current frame in psychology is concerned with small p-values, whereas the proposed frame is concerned with the magnitude of standardized, estimated differences (Effect Sizes) and the accuracy of those estimates. It seems trivial, but the distinction is important. A good example comes from how we interpret polls. Take for example the current approval rating of the Affordable Care Act. As of November 10th, 55% of Americans disapprove of the new healthcare law, and 40% approve, +/- 4 percentage points. This is a margin of error based on a 95% confidence interval. It tells us that 51% might be the true disapproval rating, or maybe 44% is the true approval rating. This is fairly useful information not only because it gives use a range, but because even when we assume the true values fall on the end points, we can be confident saying that more people disapprove than approve of the ACA. Now what if this same information was framed like a lot of psychological research reports? How would we use “more than 50% disapprove, p<0.00000001?” It could be that 51% or 60% disapprove. Even if the effect size is reported, it still wouldn’t be clear how accurate this number is.
To be fair, polls have the luxury of large, representative sample sizes. With N =1000, you can calculate a point estimate within a margin or error of 3 or 4 points. By comparison, in the 1st experiment from my last post the control condition had a margin of error close to 8 points on a 100 point scale; the treatment was closer to 9 points. Since in psychology we like mean bars, using a confidence interval instead of standard error bars would appear to eat up an entire mean. I guess we have to ask ourselves what we care about. Do we care about honestly representing differences, or do we care about little p-values?