Even as scientists consciously rejected religion as a basis of natural knowledge, they held on to certain cultural presumptions about what kind of person had access to reliable knowledge. One of these presumptions involved the value of ascetic practices. Nowadays scientists do not live monastic lives, but they do practice a form of self-denial, denying themselves the right to believe anything that has not passed very high intellectual hurdles.
Naomi Oreskes, Playing Dumb on Climate Change
One escape, which is admittedly difficult given how deeply ingrained the logic of hypothesis testing is in the consiousness of scientists, is to acknowledge the uncertainty inherent in our estimates as communicated through confidence intervals. The fact that a confidence interval for an effect contains zero does not mean the effect is zero. It merely means that zero is in the realm of possibility, or that one cannot say with certainty what the direction of the effect is.
Andrew F. Hayes (2012) in the context of a discussion about mediational models
I’ve been some kind of research assistant for a few years now, so I’ve learned a trick or two about making graphs in Excel. However, I’ve had a lot of trouble trying to figure out how to graph two independent sample means and their differences each with 95% confidence intervals. In the spirit of my last few blogs on the shift from using means and p-values (i.e., NHST) to using point estimates, confidence intervals and effect sizes (ESCI), I want to try to use the same data to make a graph from each approach.
I’ll use data from, “an experiment to test whether directed reading activities in the classroom help elementary school students improve aspects of their reading ability. A treatment class of 21 third-grade students participated in these activities for eight weeks, and a control class of 23 third-graders followed the same curriculum without the activities. After the eight-week period, students in both classes took a Degree of Reading Power (DRP) test which measures the aspects of reading ability that the treatment is designed to improve.” I obtained these data from the Data and Story Library. I expect the treatment group to score higher on the DRP than the control group.
Results of a one-tailed, independent samples t-test with Welch correction for unequal variances reveal that students in the treatment group (M=51.5, SD=11.0) scored significantly higher, t(37)=2.01, p<0.01, on the Degree of Reading Power test than students in the control group (M=41.5, SD=17.1).
Figure 1. Mean scores on Degree of Reading Power (DRP) by group. Error bars represent 68% confidence intervals by convention because it looks nicer.
I found a 9.95 [1.1, 18.8] point difference, d=0.7, between the scores of those students in the treatment group, 51.5 [46.5, 56.5] and those in the control group, 41.5 [34.1, 48.9].
Figure 2. Means of scores by group and the mean difference between groups on Degree of Reading Power (DRP). Error bars represent 95% confidence intervals so you can actually interpret the accuracy of the means.
So there you have it. This is what could be for the future of psychological science, or close to it. I don’t know about you, but I think that the second report of the results tells me more about what matters in the experiment–how effective is the treatment condition for improving reading power? It appears we can safely conclude that the treatment improves reading power compared to a control group, but the estimate of how much improvement isn’t very accurate. There’s a 20 point spread that includes 1.
In regards to my technical issues, I couldn’t figure out how to place the mean difference data point on the right of the graph next to the secondary axis. If anyone has any suggestions as to how this might be done in Excel, please write in the comments. Thank you.
The GRE is a strong, reliable predictor of success in graduate school. The words reliable and predictor are awesome words in research, basic and applied. If these words honestly describe some variable, then we know that not only does this variable tell us something about some outcome, but repeated measure of this variable will not change its ability to inform us about that outcome.
What about strong predictor? What about success?
Figure 1. Adapted from the online material from Kuncel & Hezlett (2007)
I’m purposefully leaving these as open questions. I’d like to know what others think about the meaning of strength and success in research. I’ve framed them with research on the GRE, but any comments that touch on the meanings (expand on definitions) of these words are welcome.
Kuncel, N. R., & Hezlett, S. A. (2007). Standardized tests predict graduate students’ success. Science.
I want to clarify and expand upon my main point from my last blog post “New” Statistics. This blog is a place for science almost as much as it’s a place for me to practice more concise writing.
So in my post, I talked about how I was worried that the results of a second experiment were “different” than the results from the first experiment. I was worried because the p-value wasn’t sufficiently small. I addressed my worries in two ways that yielded, in effect, the same results. First, I ran t-tests of the differences between the means of the two experiments. Since the p-values were insignificant, I concluded that the means weren’t different. Second, I used a bootstrapping technique to demonstrate that the means from both experiments fell within 95% confidence intervals. In effect, I did the same thing in my second analysis because, by definition, for means to fall outside of confidence intervals is to reach significance.
My goal was to illustrate a difference in the framing of analyses and research results. The current frame in psychology is concerned with small p-values, whereas the proposed frame is concerned with the magnitude of standardized, estimated differences (Effect Sizes) and the accuracy of those estimates. It seems trivial, but the distinction is important. A good example comes from how we interpret polls. Take for example the current approval rating of the Affordable Care Act. As of November 10th, 55% of Americans disapprove of the new healthcare law, and 40% approve, +/- 4 percentage points. This is a margin of error based on a 95% confidence interval. It tells us that 51% might be the true disapproval rating, or maybe 44% is the true approval rating. This is fairly useful information not only because it gives use a range, but because even when we assume the true values fall on the end points, we can be confident saying that more people disapprove than approve of the ACA. Now what if this same information was framed like a lot of psychological research reports? How would we use “more than 50% disapprove, p<0.00000001?” It could be that 51% or 60% disapprove. Even if the effect size is reported, it still wouldn’t be clear how accurate this number is.
To be fair, polls have the luxury of large, representative sample sizes. With N =1000, you can calculate a point estimate within a margin or error of 3 or 4 points. By comparison, in the 1st experiment from my last post the control condition had a margin of error close to 8 points on a 100 point scale; the treatment was closer to 9 points. Since in psychology we like mean bars, using a confidence interval instead of standard error bars would appear to eat up an entire mean. I guess we have to ask ourselves what we care about. Do we care about honestly representing differences, or do we care about little p-values?
I’ve been neurotically interested in psychology research reform for the past few years. Recently, I’ve become very curious about some of the proposals about reporting statistics. Cruising through my Twitter feed I stumbled upon Geoff Cumming’s article The New Statistics: Why and How. The article is a great summary of the guidelines proposed to help better our science.
I don’t plan to detail the article point by point, but I do want talk about one of his graphs and then explain how his illustration was useful for my own research questions.
Figure 1. Simulation graph from Cumming (2013) that I don’t have permission to reproduce.
In the section that includes this figure, Dr. Cumming quickly explains why Null Hypothesis Significance Testing (NHST) should not be used to confirm research findings. To illustrate this, he created a simulation consisting of 25 repetitions of a simple experiment with 2 independent samples. The difference between the populations is represented by the line, the circles represent different samples, and the p-values are on the left of the figure. As you can see, the circles and the confidence intervals dance around, and p-values are rarely significant. For my purposes, I want you to eyeball how often the 95% confidence interval below its predecessor’s (its replication) captures the next experiment’s mean difference, and then compare that count to how often a significant p-value is followed by another significant p-value.
To introduce my problem, I will only describe the design of a thrice replicated experiment because the research is not mine to share. In each experiment, participants received 1 of 2 treatments, or no treatment, and then they responded to a 1-item scale from 1-100 that would serve as a manipulation check for the treatments. Though the treatments were different, each was predicted to produce lower responses on the scale compared to those responses following the neutral condition. An Analysis of Variance revealed a significant difference in responses between conditions in the first experiment, but not in the second experiment. This is the problem that I want to discuss.
When I inspected the means of all 3 conditions in each experiment, I could see that they were different and why (I think) the results did not “reach significance.” My first question was, “Well, are the means really, significantly different than their corresponding means from first experiment, even if they aren’t significantly different from each other in the second experiment?” In other words, is the mean response for the neutral condition in experiment 2 significantly different than the the mean response for the neutral condition in experiment 1? What about the treatment condition means between experiments?
To test this, I took each of the means, standard deviations and sample sizes from the 3 conditions in both experiments and ran t-tests. To make extra damn sure, I made the sample sizes equal by randomly sampling the smallest group’s sample size from each of the larger samples. All the p-values were too big. Have I answered my question? Are the means statistically different? By convention, I failed to reject the null hypothesis: there is no difference between the means. Yes, I have an answer.
By this point, I felt a little better that the means weren’t statistically different, but this lead me to ask, “So does this imply that these means are expected from these populations?” In other words, if we had the grant money to run this experiment over and over, to the tune of, say, thousands of times, would we expect the means from experiments 1 & 2 to show up in one of our replications?
To visualize this, I created 95% confidence intervals around the means of all 3 conditions from not only the 2 experiments that I ran, but a third experiment ran by another researcher.
Figure 2. Control, Treatment 1 and Treatment 2 means from 3 experiments (mine are the 2nd and 3rd of each set)
You can see that confidence intervals from previous experiments capture means in the preceding ones. All is right with the world because most of the time (a little over 80% according to Cumming) a confidence interval from the previous experiment will capture the mean in the next. If I run a version of this experiment again, I’m sure that my statistical fishing nets you see here will be wide enough to snag the new means. What I hope I’ve demonstrated here is that if experiment 2 (the third mean from each condition on the graph above) was my first experiment, and if I relied on significance to make a conclusion about the differences between these conditions, then I might not run another experiment. I might conclude that these conditions do not produce meaningfully different responses on my scale, when maybe they do.
Social science has made great leaps and bounds over the past century, but it isn’t perfect. As a freshly minted undergraduate, I can assure you that students will be taught the dogma of NHST for years to come. This should change. If, as scientists, we pride ourselves in our methods for discovering truth, we should take opportunities sooner rather than later to refine those methods. Otherwise, we’re just sitting in armchairs I guess.