Is there really a replication crisis in science? And is there really a problem with the reliability and validity of the published literature?

A tiny post here (30 words) linking to another blog on problems with the published scientific literature attracted lots of hits – and left me wondering about the vexed issue of replication and reproducibility (these are not the same things) in science, and what these things mean for the working scientist. I should clarify that there is a certainly a problem of uncertain size and scale with sloppy science and also scientific fraud – a reading of Retraction Watch is salutary (and obligatory), and the cases of Diederik Stapel and Hendrik Schon are jaw-dropping. OTOH, I am inclined to think the scale of the problem is smaller than it might seem from the headlines, and scientists in my experience generally bend over backwards to be honest in the reporting of their experiments and data.

The “Many Labs” Replication Project has attracted lots of deserved (positive and negative) attention (see this for a nice summary). However, is the key problem in literature one of non-replication and non-reproducibility? I’m not so sure it is, and here I advocate a case-based analysis to moderate thinking about this problem.

Here are some provisional thoughts, in no particular order:

1. Some central findings in science can’t or won’t or shouldn’t be replicated. There are lots of examples. Here are some:

The postulated meteor strike at the Yucatan peninsula leading to the probable extinction of the dinosaurs won’t be replicated (that is not to say an asteroid strike won’t happen again, of course). The event is long since past, and the strike was not directly observed by humans – instead it is inferred to have occurred from the substantial pattern and body of available evidence. It is certainly the case that lots of research in astrophysics, natural history, geology and similar disciplines (to name just a few) will be exceedingly difficult if not impossible to replicate – but this is not an insuperable problem in and of itself.

Stylised diagram of the hippocampus
Stylised diagram of the hippocampus (Photo credit: Wikipedia)

The neurosurgery conducted on patient H.M. – the bilateral removal of his hippocampal formation for therapeutic purposes – which left him suffering a profound, broad and non-resolving anterograde amnesia, will never be conducted on a patient again. Despite this non-replication, there is no doubt in the literature whatsoever that damage to the hippocampal formation and connected structures leads to a profound and non-resolving anterograde amnesia. Why is this? It is because there is a more general claim being made which supports a general pattern of converging predictions. The claim is approximately the following: explicit memory requires an intact extended hippocampal formation; damage to this structure will lead to a grave and enduring anterograde amnesia. Thousands of papers later, no-one seriously doubts this conclusion – that the extended hippocampal formation supports explicit memory formation in the human brain. And this is despite H.M.’s surgery never being conducted again.

2. The nature of the experiment generalises to a theoretical claim – and it is testing this claim that is important (and not replicating the precise original experiment). What I mean is that sometimes precise replication is not necessary, or indeed has not occurred, for certain breakthrough results that illustrate a more general case or claim. Three examples:

Examples of rat hippocampal EEG and CA1 neural...
Examples of rat hippocampal EEG and CA1 neural activity in the theta (awake/behaving) and LIA (slow-wave sleep) modes. Each plot show 20 seconds of data, with a hippocampal EEG trace at the top, spike rasters from 40 simultaneously recorded CA1 pyramidal cells in the middle, and a plot of running speed at the bottom. (Photo credit: Wikipedia)

John O’Keefe published his astounding paper (The hippocampus as a spatial map. Preliminary evidence from unit activity in the freely-moving rat) on the discovery of hippocampal place cells in 1971, using behavioural neurophysiological techniques that were difficult to establish and rare in nature (indeed a near-simultaneously published paper by Jim Ranck, using similar techniques, does not describe place cells – although it does describe other phenomena of great interest). So far as I know, not even O’Keefe published a further replication of his original paper. Instead, he and many other labs (including my own) have described place cells in the dorsal hippocampus across a wide variety of conditions and experimental set-ups. IOW, it is O’Keefe’s generalisable theoretical claim, simply stated that the hippocampus is a cognitive map, is what is tested – not a simple replication of his original work. And indeed place cells are easy to record, and a vast literature is available on them now.

Two other examples: I don’t believe anyone has replicated precisely and exactly Tim Bliss’s original experiments showing long-lasting potentiation (subsequently long-term potentiation or LTP) in the hippocampus of anaesthetised rabbits. Again, this is an example of a paper which makes a general claim: that synapses are plastic. If this claim is true, then it should be easily, reliably and robustly demonstrable. And it is: this search gives a count of some 12000+ papers on LTP. And why? Donald Hebb predicted such plasticity as a near-ubiquitous property of certain central synapses as a mechanism for memory storage. Similarly, the original paper by Boyden and colleagues (2005) describing optogenetic control of neurons has probably never been precisely and exactly been replicated. Why? It doesn’t need to be. The more general claim, that light (in association with appropriately expressed channelrhodopsins) can be used to control the activity of neurons has been reliably and robustly replicated.

English: Long term potentiation: second stage....
English: Long term potentiation: second stage. More receptors are found on the dendrite. (Photo credit: Wikipedia)

3. Our understanding of a stunning result should quickly move along a continuum – from observation to correlation to causal relationships to underlying mechanisms.

In the case of psychology and neuroscience, we should expect as further experiments are conducted that we move from an interesting behavioural observation which is robustly and reliably replicated to psychological mechanisms and then to underlying brain systems that support the effect (and perhaps even to social systems that embed the effect too). The observation that people when primed with age-related cues subsequently walk more slowly is genuinely astounding (and has been cited (Google Scholar) more than 2500 times). However, efforts at replication (especially this, but see this for dissent and wiki for some further quick references) have led to lots of controversies about even the existence of the effect (see this on Daniel Kahneman‘s open letter regarding the ‘train wreck looming’ for priming work). We are still some distance from having even a robust set of underlying psychological mechanisms that describe limit cases: does this priming effect occur in the elderly? the anxious? depressed? teenagers? frontal-lobe patients? the limits of the effect in terms of modification of gait-control mechanisms? how many primes? what is the temporal duration? what mix of primes? how do differing priming mechanisms interact with each other? what experiment can I do next? etc. etc.

Elucidating the mechanisms is the issue here, and providing a theoretical account that interfaces with related domains that can in turn be tested because it is fruitful. ‘The Hippocampus as a Cognitive Map‘ is a super example of such a theory; Hebb’s ‘Organization of Behavior‘ is another.

4. A related issue: statistical reporting and data-warehousing. There are too many points to make about poor statistical treatment and reporting, so I’ll pick on a few personal bugbears. One  that drives me insensate as a referee is authors including a sentence consisting of weasel words about ‘a trend toward significance’. I always ask for such sentences to be removed. p<0.05 is not some gold medal to be attained, and and p<.06 is really a silver medal. Statistical outcomes like this minimally suggest the study is statistically underpowered, or there is poor experimental design/control or (get over it!) that there is no effect present. Treat all such intrusive thoughts as an invitation to redo the study with extra numbers and thought. Another bugbear is the fact that journals don’t routinely require supplementary data-warehousing: data storage is, to a first approximation, free: requiring authors to provide a pdf or tab-delimited file of their statistical outputs (even if there is an embargo on their release because the dataset can be further reworked), would be a good first step (even if this only goes to the referees initially). Standards will evolve regarding reporting – but minimally providing data outputs is a good first step.

I haven’t really answered my title questions, have I? I guess my final point is that we shouldn’t think in very blunt categorical either/or binary terms about these issues. A case-study related approach can lead to very different conclusions.

Related articles

Author: Shane O'Mara