Harris cites numerous examples of Type I error (also known as false positives) in efficacy studies throughout the book—i.e., drugs that worked in disease models but then failed in human trials. Chapter Three, “A Bucket of Cold Water,” is largely devoted to discussion of such failures in amyotrophic lateral sclerosis (ALS or Lou Gehrig’s disease). ALS is a particularly telling example: the animal models are of questionable relevance at best and that is at least in part because the pathophysiology of the disease is only partially understood.
In 1993, it was discovered that some patients with the familial form of the disease had a mutation in the SOD1 gene, which codes for superoxide dismutase, an enzyme whose normal function is detoxification of superoxide radicals. The role of SOD1 in ALS pathogenesis is still not fully understood, but transgenic mice expressing mutant SOD1 did exhibit neurodegenerative signs and scientists seized on this as–at last–a means of testing for compounds that might be clinically effective in ALS. Unfortunately, this turned out to be a little like the drunk searching for his keys under the streetlight[1]: all of the active leads turned out not to be effective in ALS patients. Harris is highly critical of rushing such compounds into clinical testing, but let’s face it—if the drunk finds a set of keys under the streetlight, shouldn’t he try to find out if they are his?
Some of the animal studies may have been poorly designed, as Harris charges, but we suggest that the more fundamental problem is the complexity of the disease and the still preliminary state of our understanding of its pathogenesis. Many other mutations have now been identified in familial ALS patients, so there is cause for optimism that improved understanding may help in unwinding this intricate problem. It is sobering to realize that the familial form of the disease accounts for only about 10% of all cases and only part of that 10% is associated with the known mutations. One thing is for sure: animal research will certainly play a key role in the solution.
Harris’s main criticism of the SOD1 studies is that they were underpowered (i.e., had too few animals). When the ALS Therapy Development Institute re-tested eight drugs that seemed active in mice but failed in humans–using improved protocols, with greater numbers of animals in particular–none of them were active. The risk of an underpowered experiment to produce false negatives is well known and understood, but the capacity of underpowered studies to produce false positive results by stochastic processes is not always appreciated. Statistical clustering can occur in just about any population. In a large sample, random clusters tend to “average out” and any remaining cluster may represent a real association. However, if a small sample happens to be selected from a random cluster, it may appear to be “real” if its P-value is small enough. As John Bohannon of the Chocolate Diet Hoax[2] puts it, “Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. … That study design is a recipe for false positives.”[3] This does not necessarily mean that low-powered studies should be avoided—but it does mean that positive results from low-powered studies should be confirmed.
It is worth noting that measuring a large number of things about a small number of animals is precisely what we do in most toxicology studies. We measure body weights, organ weights (absolute and normalized), and 40 or so clinical pathology parameters, among other things. It is quite likely that one or more of these will yield a difference that is “statistically significant” at P≤0.05, even if there is no underlying cause. Therefore, we look for correlation with other parameters in order to interpret the result. For example, if there is elevated ALT, are other liver markers elevated? If so, is there histopathologic evidence of liver toxicity? If not, and unless the increase is extreme, it is likely to be a statistical fluctuation.
In our next post, we will discuss some of the consequences of the choice of P=0.05 as the threshold of statistical significance.
[1] A passerby observed an obviously inebriated man crawling around on his hands and knees under a streetlight. “What are you looking for?” the passerby inquired.
“My keys,” said the drunk.
Trying to be helpful, the man asked, “Where did you last have them?”
“Over there on the other side of the street.”
“Well, if you lost them over there, why are you looking over here?”
“I’ll never find them over there—it’s too dark. The light is better over here,” replied the drunk.
[2] Bohannon and his collaborators divided their subjects into three groups, one of which specifically included bitter chocolate in their diet. The subjects weighed themselves each day for 21 days and finished with a round of questionnaires and blood tests. As it turned out, the chocolate group lost weight faster than the other groups and the difference was statistically significant. Bohannon et al. were able to get their results published, which set off a wave of purported weight loss diets based on consumption of chocolate, probably not the only example of a fad diet based on faulty science.
[3] See http://www.npr.org/sections/thesalt/2015/05/28/410313446/why-a-journalist-scammed-the-media-into-spreading-bad-chocolate-science