Psychology Should Aim For 100% Reproducibility

Last week, the Open Science Collaboration reported that only 36% of a sample of 100 claims from published psychology studies were succesfully replicated: Estimating the reproducibility of psychological science.

A reproducibility rate of 36% seems bad. But what would be a good value? Is it realistic to expect all studies to replicate? If not, where should we set the bar?

In this post I’ll argue that it should be 100%.

fixing_science

First off however, I’ll note that no single replication attempt will ever have a 100% chance of success. A real effect might always, just by chance, not be statistically significant, although with enough statistical power (i.e. by collecting enough data) this chance can be made very low.

Therefore, when I say we should aim for “100% reproducibility”, I don’t mean that 100% of replications should succeed, but rather that the rate of successful replications should be 100% of the statistical power.

In the Open Science Collaboration’s study, for example, the average power of the 100 replication studies was 0.92. So 100% reproducibility would mean 92 positive results.

Is this a realistic goal?

Some people argue that if psychologists were only studying highly replicable effects, they would be studying trivial ones, because interesting psychological phenomena are more subtle. As one commenter put it,

Alan Kraut, executive director of the Association for Psychological Science and a board member of the Center for Open Science, noted that even statistically significant “real findings” would “not be expected to replicate over and over again… The only finding that will replicate 100 per cent of the time is likely to be trite, boring, and probably already known.”

I don’t buy this. It may be true that, in psychology, most of the large effects are trivial, but this doesn’t mean that the small, interesting effects are not replicable. 100% reproducibility, limited only by statistical power, is a valid goal even for small effects.

Another view is that interesting effects in psychology are variable or context-dependent. As Lisa Feldman Barrett put it, if two seemingly-identical experiments report different results, one confirming a phenomenon and the other not,

Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon… is true only under certain conditions.

Now, my problem with this view is that it makes scientific claims essentially unfalsifiable. Faced with a null result, we could always find some contextual variable, however trivial, to ‘explain’ the lack of an effect post hoc.

It’s certainly true that many (perhaps all!) interesting phenomena in psychology are context-dependent. But this doesn’t imply that they’re not reproducible. Reproducibility and generalizability are two different things.

I would like to see a world in which psychologists (and all scientists) don’t just report the existence of effects, but also characterise the context or contexts in which they are reliably seen.

It shouldn’t be enough to say “Phenomenon X happens sometimes, but don’t be surprised if it doesn’t happen in any given case.” Defining when an effect is seen should be part and parcel of researching and reporting it. Under those defined conditions, we should expect effects to be reproducible.

ResearchBlogging.orgOpen Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349 (6251) PMID: 26315443

  • I agree. The challenge, of course, is that to achieve a reproducibility rate defined primarily by the statistical power of the original study we need to ensure that the original study is as unbiased as possible. And as we all know, this is far from the case in psychology — publication bias and QRPs are the norm, leading to the proliferation of Type I and Type M errors. As long as these biases persist the reproducibility rate will always be some (probably quite low) fraction of the original power.

    What I would say is that when the original study is a registered report, which are as unbiased as we can envisage, the reproducibility rate should hopefully get pretty close to the original power. Once a critical mass of registered reports are published this hypothesis could put to the test: e.g. comparative reproducibility of 100 RRs vs 100 non-RRs (For the uninitiated, more on registered reports here: https://osf.io/8mpji/wiki/home/)

  • I have to disagree. There are cases where it is reasonable to publish findings which are not definitive. For instance, you may study a population that is very hard to access. Or you don’t have resources to conduct more studies on a given topic. Your options in these cases are to either publish preliminary findings, or put your data in the file drawer. I think that the former is the better option and you get less than 100% reproducibility in that case. You should of course describe your findings as preliminary, don’t draw far-fetched conclusions from them, etc. But expecting 100% reproducibility would go against scientific progress.

    • Unfortunately, the kind of research you describe pretty much sums up the majority of psycholoy and cognitive neuroscience: underpowered, preliminary studies, the results of which are oversold in the interests of storytelling. I’d rather see a third option to your two: slow down and, when testing hypotheses, invest in well powered, genuinely prospective experiments, pooling resources as necessary to provide more reliable outcomes.

      • I agree with your suggestions. It is better to design studies so that they are definitive. Nevertheless, that is not always possible. And since it is not always possible, you will be in some cases put before the decision to publish preliminary results, or put the data in the file drawer. If you choose the former, you cannot expect 100% reproducibility. I agree that there is a problem with reproducibility, but that doesn’t necessarily mean that we should aim for 100% reproducibility.

  • I doubt the average power was 92%. Not sure how they did their power calculations but standard ones will overestimate power on average (because they treat the effect as fixed and because small studies produce underestimates of the population SD more often than overestimates).

  • Regarding the proper reporting of context that you suggest: many people have noted that it is not possible to report all contextual factors in an experiment. But there is a simple rule (that I suggested in http://centerforopenscience.github.io/osc/2014/05/28/train-wreck-prevention/ ) regarding the reporting of context in empirical claims. That rule is: anything that isn’t claimed explicitly as contextual is assumed to be claimed as context-independent. So if you do an experiment with psychology undergrads, and your claim is about psychology undergrads and not about *Dutch* psychology undergraduates, you are thereby claiming that it is a culture-independent effect. If you don’t want that, you should state in your claim that it is about Dutch students. Same thing if it’s only about males,younger than 28, with full moon, during winter, or whatever.

    • Agreed.

  • In the Open Science Collaboration’s study, for example, the average power of the 100 replication studies was 0.92.

    This estimate of power is based on the reported effect sizes in the original studies and it is easy to demonstrate that these effect sizes are inflated. Thus, the true power of replication studies was not 92%. If it had been 92%, we could interpret the success rate of 36% as evidence that the replication studies were not exact replication studies (moderators, etc.). However, it is also possible that the true power was much lower than 92% and that the low success rate of 36% reflects the true power of the original studies.

    To make statements about power in the original and replication studies we need to take publication bias into account. Here is a link to a post that does it for social psychology studies in OSF.

    https://replicationindex.wordpress.com/2015/09/03/comparison-of-php-curve-predictions-and-outcomes-in-the-osf-reproducibility-project-social-psychology-part-1/

    and here is one for cognitive psychology

    https://replicationindex.wordpress.com/2015/09/05/comparison-of-php-curve-predictions-and-outcomes-in-the-osf-reproducibility-project-part-2-cognitive-psychology/

    It is clear that the replication studies in OSF did not have 92% power. Actual power estimates are 35% for social and 75% for cognitive psychology.

    • True, although if the original effect size was an overestimate (which, due to QRPs/publication bias, it probably is), maybe the effect will always fail to replicate – the original claimed effect size will not be reproduced in the replication studies, even if the effect is significant and in the same direction.

Open bundled references in tabs:

Leave a Reply