Reproducing results: how big is the problem?

“Modern scientists are doing too much trusting and not enough verifying – to the detriment of the whole of science, and of humanity. Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis.” This was the conclusion of The Economist’s leader writers in 2013, after the magazine published a story on what is often referred to as science’s “reproducibility crisis”.

Worries about irreproducibility – when researchers find it impossible to reproduce the results of an experiment when it is rerun under the same conditions – came to the fore again last week when a landmark effort to reproduce the findings of 100 recent papers in psychology failed in more than half the cases (“More than half of psychology papers are not reproducible”, 27 August). But the concerns are not new. Dorothy Bishop, professor of developmental neuropsychology at the University of Oxford, who chaired an Academy of Medical Sciences conference on the issue in April, recently pointed out on her blog that reproducibility was a significant worry for the 17th-century scientist Robert Boyle. He lamented that “you will find…many of the Experiments publish’d by Authors, or related to you by the persons you converse with, false or unsuccessful”.

According to Brian Nosek, a professor of psychology at the University of Virginia and co-founder and executive director of the Center for Open Science, which ran the psychology reproducibility project, methodology texts in the 1960s mention many of the same problems and discuss some of the same solutions that have been highlighted recently. Two decades ago, an editorial in the British Medical Journal decried “the scandal of poor medical research”, carried out by “researchers who use the wrong techniques (either wilfully or in ignorance), use the right techniques wrongly, misinterpret their results, report their results selectively, cite the literature selectively, and draw unjustified conclusions”. And John Ioannidis’ landmark 2005 paper on “Why most published research findings are false” has been viewed nearly 1.4 million times.

But the issue of reproducibility really began to reach mainstream scientific and public consciousness after the 2011 publication of a paper in Nature by researchers from Bayer HealthCare, a German pharmaceutical company. The paper, “Believe it or not: how much can we rely on published data on potential drug targets?”, reported that the company had been able to replicate only between 20 and 25 per cent of 67 published preclinical studies, mostly in cancer.

The alarm was reinforced in 2012 by another Nature paper, “Drug development: raise standards for preclinical cancer research”, which reported that the Californian pharmaceutical company Amgen had been able to reproduce just six of 53 “landmark” cancer studies it tested. It described that 11 per cent success rate as “shocking”: “Clearly there are fundamental problems in both academia and industry in the way such research is conducted and reported,” the paper concluded.

According to Nosek, the lack of detail in the Bayer and Amgen papers about what they actually did prompted some academics to dismiss them entirely, on the grounds that “we have no idea if they did anything competently”. And he concedes that although there is much circumstantial and theoretical evidence of problems, such as that published by Ioannidis, “direct evidence” is still lacking.

But, for Mark Winey, a professor of molecular, cellular and developmental biology at the University of Colorado Boulder, who recently chaired a “task force” on irreproducibility for the American Society for Cell Biology, the Bayer and Amgen papers were “a real wake-up call”. “There were concerns about cell line contamination going back to the 1960s…but those papers raised broader issues about other types of reagents and the lack of detail in published protocols,” he says.

Chris Chambers, head of brain stimulation at Cardiff University, says that another part of the reason for irreproducibility’s rise to prominence is the attention generated in recent years by a string of major research fraud cases, perhaps most famously that of Diederik Stapel, the eminent Dutch social psychologist who turned out to be a serial fabricator of data. Chambers shares the common view that even if fraud is more common than is typically acknowledged, it is unlikely to be the major reason for such high levels of irreproducibility. However, “in the process of trying to understand how fraud cases could have happened, you identify all these other problems that aren’t fraud but are on the spectrum”, he explains.

But why do people find themselves adopting practices that are on the fraud spectrum in the first place? One reason frequently cited is the overvaluation by funders and institutions of publications in high-impact journals. The claim is that while researchers are busy cutting corners and torturing data in order to secure that career-defining publication, top journals’ concern with maximising their impact factors and their prominence in the mainstream press leads them to, in Winey’s words, “push papers through with insufficient review or addressing of concerns”.

“The incentives that motivate individual scientists are completely out of step with what is best for science as a whole,” Chambers says. “If we built aircraft the way we do basic biomedical research, nobody would fly because it wouldn’t be safe. But in biomedicine risk-taking is rewarded.”

Nosek agrees: “It is not necessarily in my interest to learn a new statistics technique or show you all the false starts we had. You would probably get different answers from scientists to the questions: ‘Do you want your paper to be reproducible?’; ‘Do you hope that it is?’; and ‘Do you think that it actually is?’”

He says a colleague once warned junior colleagues never to try to carry out direct replication of their own work lest they be “confronted with the effect going away. That is crazy in terms of how science is supposed to operate.”

Journals’ supposed reluctance to publish negative findings is also blamed for the fact that any number of labs may waste time attempting to pursue research avenues or build on results that others have already found to be flawed.

Furthermore, journals’ desire for neat stories is also part of the reason for the widespread perception that the methods sections of papers do not supply enough information about what was done to permit replication.

According to Elizabeth Iorns, founder and chief executive of contract research company Science Exchange, the reality of science is “messy”, so “people exclude things that don’t fit perfectly with the story, which means you aren’t seeing the whole picture”.

Another bugbear is the length restrictions print journals typically impose on methods sections. However, even in online journals with unlimited space, methodological detail is often lacking. As Nosek says: “I don’t want to have to show all these things as an author, and I don’t care to ask for them as a reviewer. We are our own worst enemies.”

In a 2014 Nature article setting out the concerns of the US National Institutes of Health about irreproducibility, Francis Collins, the institute’s director, and Lawrence Tabak, its principal deputy director, added that “some scientists reputedly use a ‘secret sauce’ to make their experiments work – and withhold details from publication or describe them only vaguely to retain a competitive edge”.

According to Iorns, other bars to the reproduction of published findings include the difficulty of contacting the original experimenters, who have often moved on and left their lab books behind, and the difficulty of obtaining the materials that they used, such as genetically modified animals.

Concerns also abound about the purity of commercially produced reagents and cell lines. “You have to test that what you have got is what you think it is,” according to Chambers.

The use of statistics is another major worry. According to Chambers, the pressure on researchers to “crank out” papers means that they are more likely to carry out a succession of small studies rather than one larger one. But this runs the risk that the studies are “statistically underpowered”, lacking enough data points to draw reliable conclusions. This means that the experimenter is “more likely to miss a true discovery, but also more likely to find something that isn’t real”. A particular concern that animals are essentially wasted in statistically underpowered experiments led the UK research councils earlier this year to begin requiring grant applicants to demonstrate that their experiments will give “robust results”.

Statistics are often crucial to the claim that there is a causal link between two observed phenomena. Typically, the hypothesised link is deemed to be effectively proved when the likelihood of the same observation occurring by chance is less than 5 per cent – or, in technical terms, where the “p-value” is less than 0.05. Critics assert that the concept of proof in this probabilistic context is misguided and, worse, that many unscrupulous or statistically illiterate scientists routinely engage in “p-hacking”. This involves measuring multiple variables and trawling through the results until a relationship with a p-value of less than 0.05 is uncovered. The culprits then write their paper as if that were the result they had hypothesised all along.

“It is a bit like the Texas sharpshooter fallacy, where you spray the wall with a machine gun and then draw the target around where you happened to hit,” as Chambers puts it. “In psychology,” he adds, “a very high proportion of people admit to having done this.”

The problem with p-hacking is that, according to Bishop, the statistics have a “different meaning” depending on whether the observation was genuinely hypothesised or not, because, when multiple relationships are examined, the odds of finding one that is statistically significant are relatively high.

“We have enormous statistics [programs] that do very complex things at the touch of a button, and a lot of people don’t understand quite what they are doing,” she says.

A 2012 paper in the journal Psychological Science, “Measuring the prevalence of questionable research practices with incentives for truth telling”, based on a survey of 2,000 psychologists, found that various “questionable practices may constitute the prevailing research norm”, and the journal Basic and Applied Social Psychology recently banned all mention of p-values.

Open all references in tabs: [1 - 6]

Psychology

Leave a Reply Cancel reply