Wednesday, September 29, 2010

Where does bad fMRI science writing come from?

A perennial favorite topic on science blogs is the examination of badly designed, or badly interpreted fMRI data. However, little time is spent on why there is so much of this material to blog about! Here, I’m listing a few reasons why mistakes in experimental design and study interpretation are so common.

Problems in experimental design and analysis
These are problems in the design, execution and analysis of fMRI papers.

Reason 1: Statistical reasoning is not intuitive
Already, I have mentioned non-independence error in the context of clinical trial design. In fMRI, researchers often ask questions about activity in certain brain areas. Sometimes these areas are anatomically defined (such as early visual areas), but more often they are functionally defined, meaning they are areas that cannot be distinguished from surrounding areas by the physical appearance of the structure, but are rather defined by responding more to one type of stimulation than another. One of the most famous functionally defined areas is the Fusiform Face Area (FFA), which responds more to faces than objects. Non-independence error often comes from these functionally defined areas. It is completely kosher to run a localizer scan containing examples of stimuli known to drive your area and stimuli known to contrast with it (faces and objects, in the case of the FFA), and then run a separate experimental block containing whatever experimental stimuli you want. Then, when you analyze your data, you test your experimental hypothesis using the voxels (“volumetric pixels”) defined by your localizer scan. What is not acceptable is to run one long block that defines your region of interest in the context of the experimental block. 

A separate, but frequent error in fMRI data analysis is the failure to correct for multiple comparisons. There are hundreds of thousands of voxels in the brain, so it is probable that high activation in any particular voxel could be due to random chance. Making this point in a memorable way was Craig Bennett and colleagues who found a 3-voxel sized area in the brain of a dead salmon that responded to photographs of emotional situations. Of course, the deceased fish was not thinking about highly complex human emotions, the area was due to chance.

Now, it is all too easy to read about these problems and feel very smug about the retrospectively obvious. But it’s not that these researchers are misleading or dumb. But the non-independence problem stated another way is “we found voxels in the brain that responded to X, and then correlated this activation with Y”. Part of the controversy surrounding “voodoo correlations” surrounds the fact that, intuitively, there doesn’t seem to be much difference between correct and incorrect data analysis. Another important factor affecting the persistence of incorrect analysis is the fact that statistical reasoning is not intuitive, and that our intuitions have systematic biases.

Reason 2: It is both too easy and too hard to analyze fMRI data
There are many steps to fMRI data analysis, and there are several software packages available to do this, both free and non-free. Data fresh out of the scanner need to be pre-processed before any data analysis takes place. This pre-processing takes out small movements made by subjects, smoothes the data to take out noise, and often warps each individual’s brain to a standard brain. For brevity, I will refer the reader to this excellent synopsis of fMRI data analysis. The problem is that it is altogether too easy to “go through the motions” of data analysis without understanding how decisions made about various parameters affect the result. And although there is wide consensus about the statistical parameters used by analysis packages, this paper shows that differences in statistical decisions made by software developers have big effects in the overall results of the study. It is, in other words, too easy to go through analysis motions that are too hard to understand.

Problems in stretching the conclusions of studies
In contrast, these are problems in the translation of a scientific study to the general public.

Reason 3: Academics are under pressure to publish sexy work
As I discussed earlier, academia is a very competitive, and it is widely believed that fMRI publications have higher impact in hiring and tenure decisions than do behavioral studies. (Note, I have not found evidence of this, but it seems like someone should have computed it). Sexy fMRI work makes great sound-bites for university donors. (See “neuro-babble and brain-porn”). Here, slight exaggerations of the conclusions may be formed (and noisy peer review does not catch it).

Reason 4: Journals compete with one another for the sexiest papers
Science and Nature each have manuscript acceptance rates below 10%. If we assume that more than 10% of all papers submitted to these journals have sufficient quality to be accepted, then it is likely that some other selection criteria is being applied during the editorial process, such as novelty. It is also of note that these journals have word limits of < 2000 words, making it impossible to fully describe experimental techniques. Collectively, these situations make it possible for charismatically expressed papers with dubious methods to be accepted.

Reason 5: Pressure on the press to over-state scientific findings
Even for well-designed, well-analyzed and well-written papers, things can get tricky in the translation to the press. Part of the problem is the fact that many scientists are not completely effective communicators. But equally problematic is the pressure placed on journalists to express every study as a revolution or break-through. The truth is that almost no published papers will turn out to be revolutionary in the fullness of time; science works very slowly. Journalists perceive that their audience would rather hear about the newly discovered “neural hate circuit” or the “100% accurate Alzheimer’s disease test” than the moderate-strength statistical association found between a particular brain area and a behavioral measure.

Reason 6: Brain porn and Neuro-babble
(I would briefly like to thank Chris Chabris and Dan Simons for putting these terms into the lexicon).  “Brain porn” refers to the colored-blob-on-a-brain style photographs that are nearly ubiquitous in popular science writing. “Neuro-babble” is often a consequence of brain porn: when viewing such a science-y picture, one’s threshold for accepting crap explanations is dramatically lowered. There have been two laboratory demonstrations of this general effect. In one study, a bad scientific explanation was either presented to participants by itself, or with one of two graphics: a bar graph or a blobby brain scan image. Participants who viewed the brain, but not the bar graph or no image were more likely to say that the explanation made sense. In the other, a bad scientific explanation was given to subjects, either alone or with the preface “brain scans show that….”. Non-scientist participants as well as neuroscience graduate students were more likely to rank the prefaced bad explanations as better, even though the logic was equally un-compelling in both cases. These should serve as cautionary tales for the thinking public.

Reason 7: Truthiness
When we are presented with what we want to believe, it is much harder to look at the world with the same skeptical glasses.


  1. Do you happen to know any better papers which compare fMRI analysis packages than the Fusar-Poli (2010) paper to which you link? That paper does not compare analysis packages so much as two very different statistical approaches: GLM and Gaussian Random Field theory for SPM, nonparametric bootstrapping for XBAM. I don't find it surprising that these different statistical algorithms would yield different results given the deliberate and different design philosophies underlying the algorithms (eg: XBAM makes no normality assumptions, unlike SPM). (For this reason, I also think the Fusar-Poli paper's title is misleading.) However, there probably are important differences between software implementations of the same (or similar) preprocessing and stats algorithms. Hence my original question.

  2. Off hand, no I'm not aware of others, although I agree that it's needed literature.