*[This is a guest post by Alex Kogan. Last week, Ed Yong at Not Exactly Rocket Science covered a paper positing an association between a genetic variant and an aspect of social behavior called prosociality. On Twitter, Daniel and Joe dismissed this study out of hand due to its small sample size (n = 23), leading Ed to update his post. Daniel and Joe were then contacted by Alex Kogan, the first author of the study in question. He kindly shared his data with us, and agreed to an exchange here on Genomes Unzipped. Our comments on the study are here; this is Alex’s reply.]*

It’s a truism that resonates across science: Size matters when doing and interpreting the statistical (and practical) meaning of a study. But the size of what? Well, it’s quite a few things—all of which are very important in understanding what a study is ultimately telling us. One of the first numbers researchers focus on is the p-value. The p-value relies on a bit of counterintuitive logic: It represents the percentage of times you would get an effect as big as you got (or bigger) if there is really no effect in the general population. So we first assume that there is really no difference in some outcome between two groups across the general population (we call this the null hypothesis), and then we ask what are the chances of us finding the difference that we found (or bigger) given this assumption. If this percentage is low (many fields adopt a p = .05 standard, or a 5% chance that we’d get the effect we got or bigger if there is really no effect in the general population), then we can reject the initial idea that there is no difference in the general population. So what have we learned if the p-value is .05 or lower? That there is likely a difference in the general population—how big this difference is remains a mystery, however; the p-value never answers that question.

So what can affect a p-value? One big factor is the number of participants in your study—the more people you have, the smaller the p-value becomes, and the more sure you can be that your effect is representative of a real difference in the population (though again, the p-value doesn’t tell us how big this difference is). Another big factor is the strength of the effect—the bigger the effect (i.e. the bigger the difference between two groups), the smaller the p-value is going to be. Finally, we can think about the p-value of finding any difference in the study—so what are the odds of us finding a difference between two groups on any of the different outcomes we are looking at. Here, the more outcomes we look at, the larger the true p-value for the study becomes because through such an approach, we are actually increasing the odds of us finding a false positive (i.e. getting an effect by chance).

Let’s focus on this last point a bit more deeply. If we have one outcome we are interested in, and then do a study to see if two groups differ in this outcome, then the p-value you get is not biased. But imagine that you have 100 different outcomes in a study. Now if you check to see if there are any group differences on any of these 100, odds are you will find some because of change—remember that a p-value of .05 still means that 5% of the time, you will get a significant difference in the study even if the original population does not have any difference between the two groups. In genome-wide association studies, this point is especially important since we are looking at many, many potential outcomes. So we correct for the inflated chance of finding an effect by chance by doing a correction for 50,000 comparisons. This drops the accepted p-value to .000001, but to get anything to be significant at this level, a very large sample is necessary (and this is one of the big reasons why for genome wide studies, thousands of participants are necessary). This larger sample is necessitated by the fact that when doing so many comparisons, finding significant differences is not surprising—some will certainly occur by chance. So for any effect to be trusted, it must cross this much higher threshold. But when looking at just one outcome (as occurs in candidate gene studies), this problem of over inflated chance of finding false positives isn’t an issue since only 1 comparison is being done—and thus much smaller number of participants is needed to make a reasonable claim.

When we evaluate whether a study is reporting for us a real difference, we must consider all these different factors. For starters, what is the p-value? If the p-value is .001, then that means that there is only a 1 in 1000 chance that the study could have gotten the effect they did (or bigger) if there was really no effect in the general population. So unless the authors were incredibly unlucky (or lucky!) or are biasing their results through other less than ethical practices, we can say that there is likely a real difference between the two groups in the general population. But here we must be extremely careful! What population are we talking about? It’s not quite everyone on the planet we are talking about; the study is really only valid for the population from which the participants were drawn. So if the study was done amongst undergraduates at Harvard, well then the p-value is really telling us something only about undergraduates from Harvard; we need to do studies with other populations to see if this effect generalizes to other populations. When it fails to do so, then what is likely going on is not that there is no real effect amongst the Harvard undergrad population, but instead that there are other factors that differentiate the Harvard undergrads from the other groups that are being studied, and these other factors are serving as moderators. All of this is extremely important when we look at replication studies. If a study is attempting a true replication, then it must conduct the replication in the same population. When the population changes, then the study is introducing a plethora of new variables that that can moderate the particular effect under investigation—and this is a huge issue in evaluating the medical genetics replication studies that were mentioned in the original post.

The lesson here is that dismissing studies that fail to replicate in different populations is inappropriate; a replication is only a true replication when the same population is being evaluated. When a different population is being evaluated, the study is introducing numerous confounds—and simply having a bigger number of participants in the replication than in the original study does not in any way make up for this problem.

Additionally, what is dismissed today can be revised tomorrow. For instance, the most recent meta-analysis of the serotonin transporter gene (the sadly mislabelled “depression gene”) concluded that there is indeed an effect of the gene on depression, which prior meta-analysis (which used far fewer studies in their analyses) had concluded does not exist. The world of research is dynamic and ever-changing—and so it’s generally good practice to avoid making too strong of statements about the existence (or lack of existence) of any given relationship. We must be all too careful in dismissing any body of work—especially in a field as young and changing as the study of genetic contribution to human behavior.

All that being said, we should not take a low p-value to mean that the effect is actually a real one that would replicate even in the same population. Researchers can do many things to make their p-values appear better than they actually are. They can screen for apparent outliers; they can collect data in waves and check whether the effect has a low enough p-value at each wave, stopping once they get their effect; or they can go on a “fishing expedition”, looking at many different outcomes and only report the ones that were significant (i.e. low p-value) without making corrections for looking at the many outcomes. This last issue is especially a big one because it is a sadly a not so uncommon practice across academic fields, and there’s no way to know if the authors did this or not unless they report it. So replication is a necessary component to feeling confident about the results.

In many ways, my colleagues and I strongly agree with the spirit of the criticism that Joe and Daniel made. We must be extremely careful in putting too much stock in one study because there are so many human factors that can inflate a p-value. So a result should be taken in the context of the broader literature. Our study benefits from over a dozen studies that have reported findings very consistent with ours using much larger samples from the same general population—Caucasians in the United States. But our study is also the first to attempt to evaluate whether other people’s perceptions are predicted by genotypes. It was in fact our hope to have a much bigger sample of targets, but we sadly only had the ability to conduct our study on the sample in hand. We are now attempting much larger replication studies. Our effect must be replicated before the study is anything more than a preliminary finding—it is a start, rather than an end. And we hope it motivates future researchers to also study this particular gene.

I have focused on discussing the broad statistical issues in this post rather than the specifics of our study because Joe and Daniel’s criticism applies all too well to the majority of genetic studies looking at complex behavioral traits—most of the studies have participant numbers in the hundreds, and most look at candidate genes rather than genome wide associations. I certainly agree that genome wide association studies have the potential to provide far more information that candidate gene studies, but genome wide studies are extremely restrictive because of the super inflated false positives issue and thus the needed correction for the p-value which necessitates a very large sample (in the thousands) to detect almost anything. Sadly, data collection from such a large number remains financially difficult for many labs, and pragmatically unrealistic for any truly complex designs. It is my great hope that as our fields develop, new solutions will emerge that will allow for truly genome wide association studies to take place on a large of enough scale to make them viable in the study of complex human traits. In the meantime, I believe there is utility to smaller scale candidate gene studies, and would advocate care in evaluating these studies and their replications because of the statistical issues I have discussed.

Alex,

Thanks again for your willingness to engage us here. One comment:

when looking at just one outcome (as occurs in candidate gene studies), this problem of over inflated chance of finding false positives isn’t an issue since only 1 comparison is being done—and thus much smaller number of participants is needed to make a reasonable claim.This is a common fallacy, but a fallacy nonetheless. This is perhaps best illustrated by noting that the logical outcome of this line of reasoning is that, if you had genome-wide data but were only interested in a single gene, you should throw away the rest of the data before looking at it!

A very nice discussion of these issues is here:

http://www.nature.com/nature/journal/v447/n7145/box/nature05911_BX1.html

One of their conclusions is perhaps relevant here:

when comparing two studies for a particular disease, with a hit with the same MAF and P value for association, the likelihood that this is a true positive will in general be greater for the study that is better powered, typically the larger study. In practice,smaller studies often employ less stringent P-value thresholds, which is precisely the opposite of what should occur.This is the fundamental issue:

Large sample sizes require more money. In the startup space, you can put out a flawed minimum viable product, get some cash, and do it right the second time. Kogan’s study is sort of the academic equivalent of this.

Alex, thanks for this; one thing that I’m not clear on (and I haven’t seen your raw data of course) is that in the paper you show the scores and the standard deviations between the groups, but these SDs show the differences to be non-significant. The bar graph showing the differences between the groups has much smaller error bars than the SDs imply, so how are you actually saying that the genotypes in fact differ? I would suggest that one interesting plot might be to employ a dot-plot rather than a bar chart. It would also be interesting to see how the different observers view the different “subjects”.

Cheers,

-Shane @shanemuk

“So what have we learned if the p-value is .05 or lower? That there is likely a difference in the general population—”

“So unless the authors were incredibly unlucky (or lucky!) or are biasing their results through other less than ethical practices, we can say that there is likely a real difference between the two groups in the general population.”

These statements are incorrect. They are examples of the p-value fallacy:

http://www.graphpad.com/faq/viewfaq.cfm?faq=1317

http://www.annals.org/content/130/12/995.abstract

http://en.wikipedia.org/wiki/P-value#Misunderstandings

“Calculation of a P value is predicated on the assumption that the null hypothesis is correct. P values cannot tell you whether this assumption is correct.”

Bayesians say that p-values are meaningless and should not be used to prove or disprove anything.

Raw data being not available, we can still do reasonable inference from this: “… people with two G-copies came across better than their peers, regardless of gender. Of the ten most trusted listeners, six were double G-carriers, while nine of the ten least trusted listeners had at least one A-copy.”

The theory we should build is: what is the distribution of probabilities that a person having two copies of G is trusted (T), given the data above. Since we don’t know the number of G-people, we must average our estimate of this probability over all prior assumptions that the total number of G people is from 6 to 14 (9 people are A-type). Turns out P(T|G) is pretty wide. The mode of the distribution is around 0.5, but the mass is concentrated at the values greater than 0.5. In fact, Probability that P(T|G) is greater than 0.5 is 0.69. That’s all we know. Literally. I might have missed or goofed a few numbers, but I doubt we can extract more theory and more confidence from the data above.

We can build other theories, more or less detailed, but what we should test is the theory, not the null-theory. “Cult of statistical significance” is a good start.

Another reference:

http://www.tqmp.org/Content/vol03-2/p043/p043.pdf

“For instance, Cohen and Cohen

(1975) demonstrate that with a single predictor that in the

population correlates with the DV at .30, 124 participants are

needed to maintain 80% power. With five predictors and a

population correlation of .30, 187 participants would be

needed to achieve 80% power.”

Also:

Wilkinson, L., & Task Force on Statistical Inference, APA

Board of Scientific Affairs. (1999). Statistical methods in

psychology journals: Guidelines and explanations.

American Psychologist, 54, 594‐604.

Incidentally, criticisms of statistical significance have been noted as far back as 1938:

Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chisquare test. Journal of the American Statistical Association, 33(203), 526-536.

Hi Joe Pickrell,

“This is a common fallacy, but a fallacy nonetheless. This is perhaps best illustrated by noting that the logical outcome of this line of reasoning is that, if you had genome-wide data but were only interested in a single gene, you should throw away the rest of the data before looking at it!”

Can I extrapolate the logic to another direction so that, in one genome-wide study (e.g. on height), to correct the p value, you need to take into account all previous published GWAS (e.g. on height) ?

Hi Calvin,

I don’t really like to think in terms of “correcting” a p-value. Instead, let’s think instead of the probability that a variant is truly associated with a disease. This probability depends on the level of evidence from the study, as well as the prior probability of the association. All these sorts of “correction factors” go into the prior. Should the number of tests you do influence your prior? I would say no. Should previous studies influence your prior? Probably. On the other hand, you might want to be conservative and not use the previous studies.

Two points:

1) The discussion on populations and replications fails to address sampling error within a population, what makes a population distinct, and most importantly, why any finding from a sample within a population is of importance to individuals outside that sample (i.e. why science uses inferential statistics in the first place)

2) The Karg study is a poor example of using meta-analytic techniques to test how the serotonin transporter gene moderates depression, as many of the ‘newly included’ studies used came from selected samples (i.e. the entire sample had heart disease) where no moderation was even tested. Furthermore, the example was mentioned in this post as meta-analyses of a main effect, whereas the meta-analyses were considering an interaction.

Wow, marvelous weblog structure! How lengthy have you ever been running a blog for? you made running a blog look easy. The entire look of your site is wonderful, let alone the content!