The unreasonable ineffectiveness of fisherian tests in biology, and especially in medicine
The Unreasonable Ineffectiveness of Fisherian“Tests” in Biology, and Especially in Medicine
Deirdre N. McCloskey Abstract
Biometrics has done damage with levels of R or p or Student’s
t . The damage widened with Ronald A. Fisher’s victory in the
1920s and 1930s in devising mechanical methods of “test-ing,” against methods of common sense and scientific impact,
Stephen T. Ziliak
“oomph.” The scale along which one would measure oomph
is particularly clear in biomedical sciences: life or death. Car-
diovascular epidemiology, to take one example, combines with
gusto the “fallacy of the transposed conditional” and what wecall the “sizeless stare” of statistical significance. Some med-ical editors have battled against the 5% philosophy, as did,for example, Kenneth Rothman, the founder of Epidemiol-ogy. And decades ago a sensible few in education, ecology,and sociology initiated a “significance test controversy.” But,grantors, journal referees, and tenure committees in the statis-tical sciences had faith that probability spaces can substitutefor scientific judgment. A finding of p < .05 is deemed tobe “better” for variable X than p < .11 for variable Y . It isnot. It depends on the oomph of X and Y —the effect size,size judged in the light of how much it matters for scientificor clinical purposes. In 1995 a Cancer Trialists’ CollaborativeGroup, for example, came to a rare consensus on effect size:10 different studies had agreed that a certain drug for treat-ing prostate cancer can increase patient survival by 12%. An11th study published in the New England Journal in 1998 dis-missed the drug. The dismissal was based on a t-test, not onwhat William Gosset (the “Student” of Student’s t) had called,against Ronald A. Fisher’s machinery, “real” error.1
Keywords Bayesian analysis in medicine, biometrics, Fisher, Gosset, Jef- freys, level of p, levels of t, Rothman, statistical power in medical research, statistical significance, tests of significance
October 2, 2008; accepted September 6, 2009
Biological Theory 4(1) 2009, 44–53. c 2009 Konrad Lorenz Institute for Evolution and Cognition Research
Deirdre N. McCloskey and Stephen T. Ziliak
One wishes to know the probability that a biological or medical
probably not a member of Congress. This person is a mem-
hypothesis, H , is true in view of the sadly incomplete facts of
ber of Congress. Therefore, he is probably not an American”
the world. It is a problem of inference, inferring the likelihood
(Cohen 1994: 998). Cohen is pointing out that the illogic of
of a result from the data. If the symptoms of cholera start
being probably-not-an-American is formally exactly the same
in the digestive system, then ingestion of something, perhaps
as the Fisherian test of significance. And it is mistaken. The
foul water, is a probable cause. If cases of cholera in London
structure of the logic is hypothesized that P(O | H0) is low;
in 1854 cluster around particular public wells, then bad water
observe O in the data; conclude therefore that P(H0 | O)—the
transposed conditional of the original hypothesis—is low. The
But, the statistical tests used in many sciences (though not
argument appears at least implicitly in article after article in
much in chemistry or physics) do nothing to aid such judg-
scientific journals, and explicitly in most statistics textbooks.
ments. The tests that were regularized or invented in the 1920s
by the great statistician and geneticist Ronald A. Fisher (1890–
Cohen applied the logic to an important topic in psychi-
1962) measure the probability that the facts you are examining
atry, the misdiagnosis of schizophrenia. In the United States,
will occur assuming that the hypothesis is true. Our point is
schizophrenia incidence in adults is about 2%. Like a general
that by itself, unless in a decision-theoretic context in which
attacked by peasants in 1645, it is rare. Let H0 = the person
the other relevant probabilities and their substantive impor-
is normal; H1 = the person is schizophrenic, and O = the
tance are calculated, such a test is mistaken. The mistake here
test result on the person in question is positive for schizophre-
is known in statistical logic as “the fallacy of the transposed
nia. A proposed screening test is estimated to have at least
conditional.” If cholera is caused not by polluted drinking wa-
95% accuracy in making the positive diagnosis (discovering
ter but by bad air, then economically poor areas with rotting
schizophrenia) and about 97% accuracy in declaring a truly
garbage and open sewers will have large amounts of cholera.
normal case “normal.” Formally stated, P(normal | H0) is ap-
They do. So, cholera is caused by bad air. If cholera is caused
proximately 0.97, and P(schizophrenic | H1) > 0.95.
by person-to-person contagion, then cholera cases will often be
With a positive test for schizophrenia at hand,
neighbors. They are. So, cholera is caused by person-to-person
given the more than 95% assumed accuracy of the test,
P(schizophrenic | H0) is less than 5%—statistically significant,
If the rebel Chinese general Li Zicheng was in the sum-
that is, at p = 0.05. In the face of such evidence, a person in
mer of 1645 attacked by angry peasants from whom he was
the Fisherian mode would reject the hypothesis of “normal”
stealing food, he will be dead. He is dead. Therefore, says the
and conclude that the person is schizophrenic. Then he might
usual procedure of significance testing, he was attacked by
proceed to do all sorts of good and bad things to the “patient.”
peasants. If the biological hypothesis, H , is true, then obser-
But the probability of the hypothesis, given the data, is
vations O will be observed with high statistical significance.
not what has been tested. The probability that the person is
O is observed. Therefore, H is true. But, of course, being
normal, given a positive test for schizophrenia, is in truth quite
dead is very weak evidence that Li Zicheng was attacked by
strong—about 60%—not, as Fisherians believe, less than 3%,
peasants, considering that by some accounts he committed
suicide—and after all there are many ways to die. Statisticallyspeaking, the power of the test of the hypothesis that Li was
so attacked is undefined. To be sure, being dead is “consistentwith” the hypothesis that Li was attacked by peasants, as the
= [P(Ho) · P(test wrong | Ho)]/{[P(Ho)
neo-positivist rhetoric of the Fisherian argument has it. But so
·P(test wrong | Ho)] + [P (H1) · (P test right | H1)]}
what? A myriad of other hypotheses, very different from the
alleged cause of the general’s death, such as committing sui-
[(.98) · (.03)]/[(.98) · (.03) + (.02) · (.95)] = .607,
cide or catching pneumonia or breaking his neck in a fall fromhis horse, or dying from heartbreak after losing his campaign
a humanly important difference from p = .03. The conditional
against the Manchus, are omitted from Fisherian procedures in
probability of a case being “normal” though testing positively
the statistics-using sciences, though “consistent with” the fact
as schizophrenic is, Cohen points out, “not small—of the 50
of his being dead. The Fisherian procedure, at any rate when it
cases testing as schizophrenic [out of an imagined population
proceeds (as it almost always does) without a loss function and
of 1000 people tested], 30 are false positives, actually normal,
a full discussion of Type-II error, neither falsifies nor confirms.
The psychologist and statistician, the late Jacob Cohen,
The example shows how confused—and humanly and
made our point, a very old one, in his aptly entitled article,
socially damaging—a conclusion from a Fisherian 5% science
“The Earth is Round (p < .05).” “If a person is an American,”
can be. One of us has a good friend who as a child in the
Cohen writes, in a parody of the Fisherian logic, “then he is
psychiatry–spooked 1950s was diagnosed as schizophrenic.
The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine
The friend has shown since then no symptom of the disease.
given the hypothesis” to “the hypothesis is unlikely given the data”
But the erroneous diagnosis—an automatic result of the fal-
without some additional rule of thought. Those that reject inverse
lacy of the transposed conditional—has kept him in a state of
probability have to replace it by some circumlocution, which leaves
dull terror ever since. Imagine in other arenas, with similarly
it to the student to spot where the change of data has been slipped
realistically low priors, the damage done by the transposed
in[, in] the hope that it will not be noticed. (Jeffreys 1963: 409)
conditional—in scientific work or diet pills or social welfarepolicy or commercial advertising or the market in foreign ex-
The Five Percenter longs to find a body of data “significant
change. Once one considers the concrete implications of such
and consistent with” some hypothesis. The motive is by itself
a large diagnostic error, such as believing that 3% of adults
blameless. But Jeffreys noted that the sequence of the Five
tested for schizophrenia are not-schizophrenic when the truth
Percenter’s search procedure is backwards and paradoxical
is that 60% of them are not-schizophrenic, and realizes that,
(Jeffreys 1963: 409). The Five Percenter is looking at the
after all, this magnitude of diagnostic error is governing NASA
and the departments of cardiovascular disease and breast
In the 1994 volume of the American Journal of Epi-
cancer and HIV health policy, one should perhaps worry. demiology, David A. Savitz, Kristi-Anne Tolo, and Charles
Part of the problem historically was another campaign of
Poole examined 246 articles published in the Journal around
Fisher’s, following the elder Pearson, Karl: an attempt to kill
the years 1970, 1980, and 1990. The articles were divided
off Bayes’ Theorem. By contrast, the inventor in 1908 of the
into three categories: infectious disease epidemiology, cancer
t -test for small samples, the Guinness brewer and theoretical
epidemiology, and cardiovascular disease epidemiology. Each
statistician William Sealy Gosset, was a lifelong Bayesian. He
category contained for each date a minimum of 25 articles.
defended Bayesian methods against all comers—Karl Pearson,
The main findings are presented in a Figure 4, “Percent of
Fisher, Karl’s son Egon Pearson, Jerzy Neyman (e.g., Gosset
articles published in the American Journal of Epidemiology
1915, 1922 cited in Pearson 1990: 26–27). Gosset in fact used
classified as partially or completely reliant on statistical sig-
Bayes’ Theorem in his revolutionary papers of 1908, and cru-
nificance testing for the interpretation of the study results, by
cially so in “The Probable Error of a Correlation Coefficient.”
topic and time period” (Savitz et al. 1994: 1050). The find-
In 1915 he wrote to the elder Pearson: “If I didn’t fear to waste
ings are not surprising. The study shows that in 1990 some
your time I’d fight you on the a priori probability and give you
60% to 70% of all cardiovascular and infectious disease epi-
choice of weapons! But I don’t think the move is with me; I
demiologists relied exclusively on statistical significance as a
put my case on paper last time I wrote and doubt I’ve much to
criterion of epidemiological importance, as though fit were the
add to it” (September 1). Gosset was courageous, but in all his
same thing as importance. A larger share rely on the fallacy of
fights mild and self-deprecating, including for Bayes’ meth-
the transposed conditional. The abuse was worse in 1990 than
ods. In the warrior culture of hardboiled-dom in the 1910s and
1920s (the Great War mattered) he was not forceful enough.
The cancer researchers were less enchanted with statisti-
Fisher was to a great deal more forceful, and wholly in-
cal significance than cardiological and infectious disease re-
tolerant of “inverse probability” (Fisher 1922, 1926, 1956; cf.
searchers were, but did not reach standards of common sense.
Zabell 1989). In Fisher’s campaigns for maximum likelihood
Savitz, Tolo, and Poole found that after a 60% reliance on a
and his own notion of “fiducial probability” (one of the few
mere statistical significance in the early 1970s, the abuse of
campaigns of Fisher’s that failed), he tried to kill off prior and
p-values by cancer researchers actually fell. We don’t know
posterior probability, and—at least with the mass of research
why. Maybe too many people had died. Still, 40% of all the
workers as against the few high brows—he succeeded. Egon
cancer research articles in 1990 relied exclusively on Fisher’s
Pearson and Jerzy Neyman were at first persuaded by Fisher to
turn from Bayes’ Theorem (Pearson 1966: 9, in David 1966).
In epidemiology, then, the “sizeless stare,” as we call it,
But Pearson later in life, after Fisher died, reverted to his orig-
of statistical significance is relatively recent, cancer research
inal position: “Today in many circles,” he said, “the current
being an exception. In 1970 only about 20% of all articles
vogue is a neo-Bayesian one, which is of value because it calls
on infectious disease epidemiology relied exclusively on tests
attention to the fact that, in decision making, prior information
of statistical significance. Confidence intervals and power cal-
must not be neglected” (Pearson 1990: 110). Of course.
culations were of course absent. But epidemiology was not
In 1963, the geophysicist, astronomer, and mathematical
then an entirely statistical science. Only about 40% of all em-
statistician Harold Jeffreys wrote the following:
pirical articles in infectious disease epidemiology employedsome kind of statistical test. But significance took hold, and
Whether statisticians like it or not, their results are used to decide
by 1980 some 40% relied exclusively on the tests (compare our
between hypotheses, and it is elementary that if p entails q, q does
“Question 16” in economics, where in the 1980s it was about
not necessarily entail p. We cannot get from “the data are unlikely
70%). And by 1990, most subfields of epidemiology had like
Deirdre N. McCloskey and Stephen T. Ziliak
economics and psychology become predominately Fisherian.
(Altman 1991: 1900). Editors are much exercised, he observed
Statistical significance came to mean “epidemiological signif-
with gentle sarcasm, over whether to use “P, p, P , or p values”
icance.” Statistical insignificance came to mean “ignore the
(1991: 1902)—but pay no heed to oomph. “It is impossibly ide-
alistic,” Altman believed, “to hope that we can stop the misuse
Douglas G. Altman, a statistician and cancer researcher at
of statistics, but we can apply a tourniquet . . . by continuing
the Medical Statistics Laboratory in London has been watch-
to press journals to improve their ways” (1991: 1908).
ing the use of medical statistics, and especially the deployment
Steven Goodman, in a meaty pieces on the “p-value fal-
of significance testing, for 20 years. In 1991 Altman published
lacy” published in the Annals of Internal Medicine, observed
an article called “Statistics in Medical Journals: Developments
ruefully, “biological understanding and previous research play
in the 1980s.” The article appeared in Statistics in Medicine.
little formal role in the interpretation of quantitative results.”
Altman’s experience had been similar to ours in economics.
That is, Bayes’ Theorem is set aside, as is the total quality man-
At conferences and seminars and the like Altman’s colleagues
agement of medical science, the seeing of results in their con-
were convinced that the abuse of t-testing had by the 1980s
text of biological common sense. “This [narrowly Fisherian]
abated, and was practiced only by the less competent medical
statistical approach,” Goodman writes, “the key components
scientists. Any thoughtful reader of the journals knew that such
of which are P values and hypothesis tests, is widely perceived
claims were false. To bias the results in favor of the defend-
as a mathematically coherent approach to inference. There is
ers of the status quo Altman examined the first 100 “original
little appreciation in the medical community that the methodol-
articles” published in the 1980s in the New England Journal
ogy is an amalgam of incompatible elements (Goodman 1992,
of Medicine. These were new and full-length research articles
based on never-before released or published data from clinical
Altman, Savitz, Goodman, and company are not single-
studies or other methods of observation. Altman’s sample de-
tons. According to Altman, between 1966 and 1986 fully 150
sign was meant to replicate for comparative purposes an earlier
articles were published criticizing the use of statistics in med-
study by Emerson and Colditz 1983, who studied the matter
ical research (Altman 1991: 1897). The studies agreed that
R. A. Fisher significance in medical science had become thenearly exclusive technique for making a quantitative decision
The Findings
and that statistical significance had become in the minds ofmedical writers equated increasingly, and erroneously, with
It is my impression that the trends noted by Felson et al. have contin-
ued throughout the 1980s. . . . The obsession with significant p values
As early as 1978 the situation was sufficiently dire that
two contributors to the New England Journal of Medicine,
(1) Reporting of [statistically] significant results rather than those of
Drummond Rennie and Kenneth J. Rothman, published op-ed
most importance (especially in abstracts).
pieces in the journal pages about the matter (Rennie 1978;
(2) The use of hypothesis tests when none is appropriate (such as for
Rothman 1978). Rennie, the deputy editor of the journal—
comparing two methods of measurements or two observers).
and in 2006 the deputy editor of the Journal of the Ameri-
(3) The automatic equating of statistically significant with clinically
can Medical Association—was not critical of his colleagues’
important, and non-significant with non-existent. (4) The designation of studies that do or do not “achieve” significance
practice. But Rothman, who was a young associate professor
as “positive” or “negative” respectively, and the common associated
at Harvard, and the youngest member of the editorial board,
phrase “failed to reach statistical significance”. . . . A review [by other
blasted away. In “A Show of Confidence,” he made a crushing
investigators <who>] of 142 articles in three general medical journals
case for measuring clinical significance, not statistical signifi-
found that in almost all cases (1076/1092) researchers’ interpretations
cance. Citing the Freiman et al. (1978) article on “71 Negative
of the “quantitative” (that is, clinical) significance of their results
Clinical Trials,” Rothman argued that the measurement and
agreed with statistical significance. Thus across all medical areas and
interpretation of size of effects, confidence intervals, and ex-
sample size p rules, and p < 0.05 rules most. It is not surprising if
amination of power functions with respect to effect size (`a
some editors share these attitudes, as most will have passed through
la Freiman et al. by graphical demonstration) was the better
the same research phase of their careers and some are still active
way forward. Rothman—an epidemiologist and biostatistician
with a life-long interest in the rhetoric of his fields—wanted
Altman was not surprised when he found in medicine, as
secretly to ban the t-test altogether. Rennie and the other ed-
we were not surprised in economics, that his colleagues were
itors decided on a different solution. Original articles would
deluding themselves. “I noted in the first issue of Statistics
be subjected to a pre-publication screening by a professional
in Medicine that most journals gave much more attention to
statistician. Rothman was at first hopeful, thinking statisti-
the format of references in submitted articles than they gave
cal review would repair the Journal. The director of statis-
to the statistical content,” Altman wrote. “This remains true”
tical reviews was well chosen—the late Frederick Mosteller
The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine
(1916–2006), the founder of Harvard’s Statistics Department
When writing for Epidemiology, you can . . . enhance your prospects
and a giant of 20th-century data analysis. But Mosteller was
if you omit tests of statistical significance. . . . In Epidemiology, we
only the director, not the worker. Rothman tells us that he as the
do not publish them at all. . . . We discourage the use of this type of
inside critic and Mosteller as the outside director had not been
thinking in the data analysis, such as in the use of stepwise regression.
able to do anything together to raise the standards (Mosteller to
We also would like to see the interpretation of a study based not onstatistical significance, or lack of it, for one or more study variables,
Ziliak and McCloskey, University of Chicago, 21 May 2005;
but rather on careful quantitative consideration of the data in light
KJ Rothman to Ziliak, 30 January 2006). The problem with
of competing explanations for the findings. For example, we prefer a
pre-publication statistical review, of course, is that the arti-
researcher to consider whether the magnitude of an estimated effect
cles go not to the Rothmans and Mostellers and Kruskals but
could be readily explained by uncontrolled confounding or selection
out to Promising Young Jones in the outer office dazzled by
biases, rather than simply to offer the uninspired interpretation that the
his recently mastered 5% textbooks. An example nowadays is
estimated effect is “significant.”. . . Misleading signals occur when a
the “Statistical Analysis Plan” or, aptly acronymized, “SAP,”
trivial effect is found to be “significant,” as often happens in large
which lays down the minimum statistical criteria considered
studies, or when a strong relation is found “nonsignificant,” as often
acceptable by the Food and Drug Administration.
happens in small studies. (Rothman 1990: 334)
Like Gosset, Jeffreys, and Zellner, Rothman doubted the
philosophical grounding of p values (Rothman 1990: 334). As
Rothman concluded the letter by offering advice on how
to publish quantitatively, epidemiologically significant figures,such as odds ratios on specific medical risks, bounded byconfidence intervals.
If P is small, that means that there have been unexpectedly large de-
Now with his own journal, Rothman was going to get it
partures from prediction [under the null hypothesis]. But why should
right. In January 1990 he and the associate editors Janet Lang
these be stated in terms of P? The latter gives the probability of de-partures, measured in a particular way, equal to or greater than the
and Cristina Cann published another luminous editorial, “That
observed set, and the contribution from the actual value [of the test
Confounded P -Value” (Lang et al. 1998). They “reluctantly”
statistic] is nearly always negligible. What the use of P implies, there-
(p. 8) agreed to publish p-values when “no other” alterna-
fore, is that a hypothesis that may be true may be rejected because
tive was at hand. But they strongly suggested that authors
it has not predicted observable results that have not occurred. This
of submitted manuscripts illustrate “size of effect” (p. 7) in
seems a remarkable procedure. On the face of it the fact that such
“figures”—in plots of effect size lines against well-measured
results have not occurred might more reasonably be taken as evidence
for the law [or null hypothesis], not against it. The same applies to
Rothman and his associates were and are not alone, even
all the current significance tests based on P integrals. (Jeffreys 1961,
in epidemiology. The statistician James O. Berger (2003) has
quoted by Zellner 1984: 288; emphasis in original; editorial insertions
recently shown how epidemiologists and other sizeless scien-
tists go wrong with p-values. Use of Berger’s applet, a public-access program, shows Rothman’s skepticism to be empiri-
Rothman complained in his editorial in the New England
cally sound (http://www.stat.duke.edu/∼berger). The program
Journal that Fisherian “testing . . . is equivalent to funneling
simulates a series of tests, recording how often a null hypoth-
all interest into the precise location of one boundary of a con-
esis is “true” in a range of different p-values. Berger cites a
fidence interval” (Rothman 1978: 1363). In 1986 the situation
2001 study by the epidemiologists Sterne and Davey Smith,
was the same: “Declarations of ‘significance’ or its absence
which found that “roughly 90% of the null hypotheses in the
can supplant the need for any real interpretation of data; the
epidemiology literature are initially true.” Berger reports that
declarations can serve as a mechanical substitute for thought,
even when p “is near 0.05, at least 72%—and typically over
promulgated by the inertia of training and common practice”
90%” of the null hypotheses will be true (Sterne and Davey
Smith 2001; Berger 2003: 4). Berger agrees with Rothman
Rothman then became assistant editor of the American
and the authors here that on the contrary “true” is a matter of
Journal of Public Health. The chief editor of the American
judgment—a judgment of epidemiological, not mere statisti-
Journal of Public Health “seemed to be sympathetic” with
cal, significance. It is about the quality of the water from the
Rothman’s views—Rothman recalls one time when the edi-
tor backed him up in a little feud with a well-placed statisti-
Rothman’s letter itself elicited no response. This is our
cian. Still, Rothman’s views hardly set journal policy, and it
experience, too: Many of the Fisherians, to put it bluntly, seem
shows in the journal. Rothman finally found his chance when
to be less than courageous in defending their views. Hardly
in 1990, after 15 years of quiet struggle, he started his own jour-
ever have we seen or heard an attempt to provide a coherent—
nal, Epidemiology. His editorial letter to potential authors was
or indeed any—response to the case against null-hypothesis
testing for “significance.” The only published response that
Deirdre N. McCloskey and Stephen T. Ziliak
Rothman can recollect in epidemiology came from J. L. Fleiss,
of Epidemiology report confidence intervals “inferences are
a prominent biostatistician, in the American Journal of Public
made regarding statistical significance tests, often based on
Health published in 1986. But Fleiss merely complained that
the location of the null value with[out] respect to the bounds
“an insidious message is being sent to researchers in epidemi-
of the confidence interval” (1994: 1051). In other words, say
ology that tests of significance are invalid and have no place
Fidler and her coauthors, confidence intervals “were simply
in their research” (Fleiss 1986: 559). He gave no actual argu-
used to do [the null hypothesis testing ritual]” (Fidler et al. ments for giving Fisherian practices a place in research. This
is similar to our experience. Kevin Hoover and Mark Siegler
Fidler and her coauthors (2004) attempted as we have to
offered in 2005 (published 2008, with our detailed reply) the
assemble outside allies. They “sought lessons for psychology
only written response to our complaints in economics that we
from medicine’s experience with statistical reform by investi-
have seen. Courageous though it was for them to venture out
gating two attempts by Kenneth Rothman to change statistical
in defense of the Fisherian conventions, a sterling exception to
practices.” They examined 594 American Journal of Public
the spinelessness of their colleagues, they could offer no actual
Health articles published between 1982 and 2000 and 110
arguments (though they did catch us in a most embarrassing
Epidemiology articles published in 1990 and 2000:
failure to take all the data from the American Economic Reviewin the 1990s). Hoover and Siegler merely wax wroth for many
Rothman’s editorial instruction to report confidence intervals and not
p values was largely effective: In AJPH, sole reliance on p values
Even the rare courageous Fisherians, in other words, do
dropped from 63% to 5%, and confidence interval reporting rosefrom 10% to 54%; Epidemiology showed even stronger compliance.
not deign to make a case for their procedures. They merely
However, compliance was superficial: Very few authors referred to
complain that the procedures are being criticized. “Other de-
confidence intervals when discussing results. The results of our survey
fenses of [null hypothesis significance testing],” Fidler et al.
support what other research has indicated: Editorial policy alone is
observed, “are hard to find” (Fidler et al. 2004: 121). The Fish-
not a sufficient mechanism for statistical reform. (Fidler et al. 2004:
erians, being comfortably in control, appear inclined to leave
things as they are, sans argument. One can understand. If youdon’t have any arguments for an intellectual habit of a lifetime,
Rothman himself has said of his attempt to reduce p-value
reporting in his Epidemiology that “my revise-and-resubmit
Rothman’s campaign did not succeed. Fidler et al. (2004)
letters . . . were not a covert attempt to engineer a new policy,
found, as we and others have found in economics and psy-
but simply my attempt to do my job as I understood it. Just
chology and in other fields of medicine, that epidemiology is
as I corrected grammatical errors, I corrected what I saw as
getting worse, despite Rothman’s letter. Over 88% of more
conceptual errors in describing data” (quoted in Fidler et al.
than 700 articles they reviewed in Epidemiology (between
1990 and 2000) and the American Journal of Public Health
Fidler’s team studied the American Journal of Public
(between 1982 and 2000) failed, they find, to distinguish and
Health and Epidemiology before, during, and after Rothman’s
interpret substantive significance. In the American Journal of
editorial stints; before and after the International Committee
Public Health, some 90% confused a statistically significant
of Medical Journal Editors creation of statistical regulations
result with an epidemiologically significant result, and equated
encouraging the analysis of effect size; and before and after the
statistical insignificance with substantive unimportance. Epi-
changes to the AJPH’s “Instructions to Authors” encouraging
demiology journals, in other words, performed worse than the
the use of confidence intervals. Rothman as an assistant editor,
New England Journal of Medicine, Rothman’s training-ground
of course, did not make policy at the journal. He made his own
preferences known to authors, but ultimately he “carried out
Fidler and her coauthors (2004) observe that for decades
the editor’s policy,” which only occasionally overlapped with
“advocates of statistical reform in psychology have recom-
Rothman’s ideal (Rothman to Ziliak, email communication,
mended confidence intervals as an alternative (or at least a
supplement) to p values.” The American Psychological As-
Fidler et al. counted a statistical practice “present,” such as
sociation Publication Manual called them in 2001 “the best
what we call “asterisk biometrics,” the ranking of coefficients
reporting strategy,” though few seem to be paying attention
according to the size of the p-value, if an article contained
(APA Manual 2001: 22 in Fidler et al. 2004: 119; Fidler 2002).
at least one instance of it. Their full questionnaire is simi-
Since the mid-1980s, confidence intervals have been widely
lar to ours in economics (Ziliak and McCloskey 2008: 62–
reported in medical journals. Unhappily, requiring the calcula-
92), focusing on substantive as against statistical significance
tion of confidence intervals does not guarantee that effect sizes
testing. Did “significant” mean “epidemiologically important”
will be interpreted more carefully, or indeed at all. Savitz et al.
or “statistically significant”? Practice was recorded as am-
find that even though 70% of articles in the American Journal
biguous if the author or authors did not preface “significant”
The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine
with “statistically,” follow the statement of significance di-
1961, though, doctors have lost many of their skills of physi-
rectly with a p-value or test statistic, or otherwise differentiate
cal assessment, even with the stethoscope (and certainly with
between statistical and substantive interpretations. “Explicit
their hands), and have come to rely on a medical literature
power” in their checklist means “did a power calculation.”
deeply infected with Fisherianism. Shyrock’s piece appeared
“Implicit power” means some mention of a relationship be-
in a special issue of Isis on the history of quantification in
tween sample size, effect size, and statistical significance was
the sciences, mostly celebrating the statistical side of it. Puz-
made—for example, a reference to small sample size as per-
zlingly, none of the contributors to the symposium mentioned
haps explaining failure to find statistical significance. The re-
the Gosset-Fisher-Neyman-Pearson-Jeffreys-Deming-Savage
sults, alas, “Of the 594 AJPH articles, 273 (46%) reported
complex. Fisher-significance, the omission suggests, was not
NHST. In almost two thirds of the cases ‘significant’ was
to be put on trial. The inference machines remained broken.
used ambiguously. Only 3% calculated power and 15% re-
By 1988 the International Committee of Medical Jour-
ported ‘implied power.’ . . . An overwhelming 82% of NHST
nal Editors had been sufficiently pressured by the Rothmans
articles had neither an explicit nor implicit reference to statisti-
and the Altmans to revise their “uniform requirements for
cal power, even though all reported at least one non-significant
manuscripts submitted to biomedical journals.” “When pos-
sible,” the Committee wrote, “quantify findings and present
Fifty-four percent of American Journal of Public Health
them with appropriate indicators of measurement error or un-
articles reported confidence intervals; 86% did in Epidemiol-
certainty (such as confidence intervals). Avoid sole reliance on
ogy. But “Table 2 shows that fewer than 12% of AJPH articles
statistical hypothesis testing, such as the use of p values, which
with confidence intervals interpreted them and that, despite
fail to convey important quantitative information” (ICMJE
fully 86% of articles in Epidemiology reporting confidence in-
1988: 260). The formulation is not ideal. The “error” in ques-
tervals, interpretation was just as rare in that journal” (Fidler
tion is tacitly understood to be sampling error alone, when
et al. 2004: 122). The situation, they find, did not improve
after all a good deal of error does not arise from the smallness
with the years. The authors usually did not refer in their texts
of samples. “Avoid sole reliance” on the significance error
to the width of their confidence intervals, and did not dis-
should be “don’t commit” the significance error. The “impor-
cuss what is epidemiologically or biologically or socially, or
tant quantitative information” is effect size, which should have
clinically significant in the size of the effect. In other words,
been mentioned explicitly. Still, it was a good first step, and in
during the past two decades more than 600 of some 700 articles
1988 among the sizeless sciences was amazing.
published in the leading journals of public health and epidemi-
The Requirements—on which at a formative stage
ology showed no concern with epidemiological significance.
Rothman among others had contributed an opinion—were
Thus too economics, sociology, population biology, and other
widely published. They appeared for instance in the An-nals of Internal Medicine—where later the Vioxx study was
When in 2000 Rothman left his post as editor of Epi-
published—and in the British Medical Journal. More than
demiology, confidence-interval reporting remained high—it
300 medical and biomedical journals, including the American
had become common in medical journals. But in the AmericanJournal of Public Health, notified the International Committee
Journal of Public Health reporting of unqualified p “again be-
of their willingness to comply with the manuscript guidelines
came common.” Rothman’s success at Epidemiology appears
(Fidler et al. 2004: 120). But the Requirements have not helped.
to have been longer lasting. Still, interpretation in other jour-
The essence of the problem of reform—and the proof
nals of epidemiology is rare. “In both journals [Fidler et al.
that we need to change academic and institutional incen-
should add ‘but not in Epidemiology’] . . . when confidence in-
tives, including criteria for winning grants—is well illus-
tervals were reported, they were rarely used to interpret results
trated in a study of “temptation to use drugs” published in
or comment on [substantive] precision. This rather ominous
the Journal of Drug Issues. The study was financed by the
finding holds even for the most recent years we surveyed”
Centers for Disease Control. It was authored by two pro-
(Fidler et al. 2004: 123). Fidler and her team confirm in thou-
fessors of public health at Emory University (one of them
sands of tests what Savitz et al. (1994) found in the American
was an Associate Dean for Research), and a third professor,
Journal of Epidemiology in tens of thousands of tests and what
a medical sociologist at Georgia State University. The study
Rossi found in 39,863 tests in psychology and speech and ed-
was conducted in Atlanta between August 1997 and August
ucation and sociology, and management (Rossi 1990: 648).
2000. Its subjects were African-American women—mothers
The historian of medicine Richard Shyrock argued in an
and their daughters—living in low-income neighborhoods of
early paper that instruments such as the stethoscope and the
Atlanta (Klein et al. 2003: 167). The dependent variable was
X-ray machine saved some parts of medicine from the Fish-
“frequency-of-[drug] use and times-per-day” multiplied for
erian pitfall. If one can see or hear the problem, one does
each drug type and summed by month. In the 125 women
not need to rely on correlations (Shyrock 1961: 228). Since
studied the value of the dependent variable ranged from zero
Deirdre N. McCloskey and Stephen T. Ziliak
to 910, that is, from zero to an appalling 30 drug doses a
or policy oomph of such a temptations-to-use-drugs study?
In September 1978 Jennie A. Freiman, Thomas C. Statistical Significance Decides Everything
Chalmers, Harry Smith, Jr., and Roy R. Kuebler, doctors andstatistical researchers at Mt. Sinai in New York, published in
Initially, each of the temptations-to-use drugs variables was entered
the New England Journal of Medicine a study entitled “The
into simple regression equations, to determine if they were statis-
Importance of Beta, the Type II Error and Sample Size in the
tically significant predictors of the outcome measure. Next, those
Design and Interpretation of the Randomized Control Trial.”
found to be related to amount of drug use reported were entered si-
multaneously into a stepwise multiple regression equation. . . . Next,the bivariate relationships between the other predictor variables listed
Seventy-one “negative” randomized control trials were re-examined
earlier were examined one by one, using Student’s t tests whenever
to determine if the investigators had studied large enough samples to
the independent variable was dichotomous. . . . Items that were found
give a high probability (>0.90) of detecting a 25 per cent and 50 per
to be marginally—or statistically—significant predictors in these bi-
cent therapeutic improvement in the response. Sixty-seven of the trials
variate analyses were selected for entry into the multivariate equation. had a greater than 10 per cent risk of missing a true 25 per cent ther-apeutic improvement, and with the same risk, 50 of the trials couldhave missed a 50 per cent improvement. Estimates of 90 per cent con-
The authors do at least report mean values of the temp-
fidence intervals for the true improvement in each trial showed that in
tations to use drugs—a first step in determining substantive
57 of these “negative” trials, a potential 25 per cent improvement was
significance. For example, they report that women were “least
possible, and 34 of the trials showed a potential 50 per cent improve-
tempted to use drugs when they were: talking and relaxing
ment. Many of the therapies labeled as “no different from control” in
(74.0%), experiencing withdrawal symptoms (73.3%), [and]
trials using inadequate samples have not received a fair test. Concern
waking up and facing a difficult day (70.7%). And they would
for the probability of missing an important therapeutic improvement
be tempted “quite a bit” or “a lot” when they were “with a
because of small sample sizes deserves more attention in the planning
partner or close friend who was using drugs (38.5%)” or when
of clinical trials. (Freiman et al. 1978: 690; italics supplied)
“seeing another person using and enjoying drugs (36.1%)”
Freiman, who is a specialist in obstetrics and gynecology,
(Klein et al. 2003: 170). Here is how they presented their
and her colleagues, in other words, had reanalyzed 71 articles
in medical journals. Heart and cancer-related treatments dom-inated the clinical trials under review. Each of the 71 articles
When examined in bivariate analyses, 15 of the 16 temptations-to-
concluded that the “treatment”—for example, “chemotherapy”
use drugs items were found to be associated [that is, the authors
or “an aspirin pill”—performed no better in a clinical sense
assert, statistically significantly related with; not substantively signif-icantly related] with actual drug use. These were: while with friends
than did the “control” of nontreatment or a placebo. That is,
at a party (p < .001), while talking and relaxing (p < .001), while
with a partner or close friend who is using drugs (p < .001), while
Freiman et al. (1978) found that if the authors of the origi-
hanging around the neighborhood (p < .001), when happy and cele-
nal studies had considered the power of their tests—the proba-
brating (p < .001), when seeing someone using and enjoying drugs
bility of rejecting the null hypothesis “[treatment] no different
(p < .05), when waking up and facing a tough day (p < .001),
from control” as the treatment effect moves in the direction of
when extremely anxious and stressed (p < .001), when bored (p <
“vast improvement”—and in conjunction with effect size, the
.001), when frustrated because things are not going one’s way (p <
experiments would not have ended “negatively.” That is, the
.001), when there are arguments in one’s family (p < .05), when in a
clinicians conducting the original studies would have found
place where everyone is using drugs (p < .001), when one lets down
that indeed the treatment therapy was capable of producing
concerns about one’s health (p < .05), when really missing the drug
habit and everything that goes with it (p < .010), and while experi-
Specifically, Freiman et al. (1978) found that if fully 50
encing withdrawal symptoms (p < .01). (Klein et al. 2003: 171–172)
of the 71 trials had paid attention to power and effect size and
“The only item that was not associated with the amount
not merely to a one-sided, qualitative, yes/no interpretation
of drugs women used,” the article concluded, “was ‘when one
of “significance,” they would have reversed their conclusions.
realized that stopping drugs was extremely difficult’ ” (Klein
Astonishingly, they would have found up to “50 per cent im-
et al. 2003: 172). This is surely a joke, some will think, perhaps
provement” in “therapeutic effect.” The Fisherian tests of sig-
a belated retaliation for the 1990s Social Text scandal, in which
nificance, the only tests employed by the original authors of
a scientist posed as a postmodern theorist in order to expose
the 71 studies, literally could not see the beneficial effects of
its intellectual pretense. It’s not. It’s normal science in biol-
the therapies under study, though staring at them.
ogy, medicine, psychiatry, economics, psychology, sociology,
The precise standard of improvement—the minimum
education, and many other fields. But what is the scientific
standard of oomph the authors set—is a “reduction in mortality
The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine
from the control [group] mortality rate,” a baseline rate of 29.7
treatment of prostate cancer—could increase the likelihood of
per cent (Freiman et al. 1978: 691). They realize, it is not a
patient survival by an average of 12% (the 95% confidence
very strict standard of medical oomph. They are bending over
interval in the pooled data put an upper bound on flutamide-
backwards not to find their colleagues mistaken. Like Gosset,
enhanced survival at about 20% [Rothman et al. 1999]). Odds
they want to give their Fisherian colleagues the benefit of the
of 5 in 100 are not the best news to deliver to a prostate patient.
But if castration followed by death is the next best alterna-
Yet, they found that 70% of the alleged “negative” trials
tive, a noninvasive 12% to 20% increase in survival sounds
were stopped, missing an opportunity to reduce the mortality
of their patients by up to 50%. Of the patients who were
But in 1998 the results of still another, eleventh trial were
prescribed sugar pills or otherwise dismissed, in other words,
published in the New England Journal of Medicine (Eisen-
about 30% died unnecessarily. In one typical article the authors
berger et al. 1998: 1036–1042). The authors of the new study
in fact missed at α = 0.05 a 25% reduction in mortality with
found a similar size effect. But when the two-sided p-value for
probability of about 0.77 and, at the same level of Type-I error,
their odds ratio came in at 14, they dismissed the efficacious
a 50% reduction with probability about 0.42 (Freiman et al.
drug, concluding “no clinically meaningful improvement” (pp.
1036, 1039). Kenneth Rothman, Eric Johnson, and David Sug-
Each of the 71 experiments was shut down on the belief
ano (1999) examined the individual and pooled results of the
that a 30% death rate was equally likely with the sugar pill
11 separate studies, including the study conducted by Eisen-
(or whatever the control was) and with the treatment therapy,
spurning opportunities to save lives. The article shows that in
One might suspect that [Eisenberger et al.’s] findings were at odds
the original experiments as few as 15% of the patients receiving
with the results from the previous ten trials, but that is not so. From
the treatment therapy would have died had the experiment
697 patients randomized to flutamide and 685 randomized to placebo,
continued—half as many as actually died.
Eisenberger and colleagues found an OR of 0·87 (95% CI 0·70–
We agree with Rothman that the article seems in the end to
1·10), a value nearly identical to that from the ten previous studies.
lose contact with the effect size, at times advising that power be
Eisenberger’s interpretation that flutamide is ineffective was based on
treated “dichotomously” and rigidly irrespective of effect size
absence of statistical significance. (Rothman et al. 1999: 1184)
(Rothman and Ziliak, personal interview, 30 January 2006).
Rothman and his coauthors depict the flutamide effect
“Important information can be found on the edges,” as Roth-
graphically in a manner consistent with a Gosset-Jeffreys-
man put it. But overall, Rothman and we agree that it’s a crush-
Deming approach. That is, they pool the data of the separate
ing piece. The oomph-laden content of their work is exemplary.
studies and plot the flutamide effect (measured by an odds
Freiman and her colleagues note that the experiments and 71
ratio, or the negative of the survival probability in a hazard
oomph-less, premature truncations were conducted by leading
function) against a p-value function. With the graphical ap-
medical scientists. Such premature results were published in
proach, Rothman and his coauthors are able to show pictorially
Lancet, the British Medical Journal, the New England Journal
how the p-values vary with increasingly positive and increas-
of Medicine, the Journal of the American Medical Association,
ingly negative large effects of flutamide on patient survival.
and other elite journals. Effective treatments for cardiovascu-
And what they show is substantively significant:
lar and cancer, and gastrointestinal patients were abandonedbecause they did not attain statistical significance at the 5% or
Eisenberger’s new data only reinforce the findings from the earlier
studies that flutamide provides a small clinical benefit. Adding the
In 1995 the authors of 10 independent and randomized
latest data makes the p value function narrower, which is to say
clinical trials involving thousands of patients in treatment and
that the overall estimate is now more precise, and points even more
control groups had come to an agreement on an effect size.
clearly to a benefit of about 12% in the odds of surviving for patients
Consensus on a mere direction of effect—up or down, positive
or negative—is rare enough in science. After four centuries of
Rothman et al. (1999) conclude, “the real lesson” from the
public assistance for the poor in the United States and Western
latest study is “that one should eschew statistical significance
Europe for example, economists do not speak with one voice on
testing and focus on the quantitative measurement of effects.”
the direction of effect on labor supply exerted by tax-financed
That sounds right. Statistical significance is spoiling
income subsidies (Ziliak and Hannon 2006). Medicine is no
biological science, is undermining medical treatment, and
different. Disagreement on the direction of effect—let alone
is killing people. It is leaving a great deal, shall we say,
the size of effect—is more rule than exception.
So the Prostate Cancer Trialists’ Collaborative Group was
understandably eager to publicize the agreement. Each of the
10 studies showed that a certain drug “flutamide”—for the
1. This paper is a revision of chapters 14–16 in Ziliak and McCloskey 2008.
Deirdre N. McCloskey and Stephen T. Ziliak
References
International Committee of Medical Journal Editors (ICMJE) (1988) Uniform
requirements for . . . statisticians and biomedical journal editors. Statistics
Altman DG (1991) Statistics in medical journals: Developments in the 1980s.
Statistics in Medicine 10: 1897–1913.
Jeffreys H (1963) Review of L. J. Savage, et al., The Foundations of Statistical
American Psychological Association (APA) 1952 to 2001 [revisions] Publi-
Inference (Methuen, London and Wiley, New York, 1962). Technometrics
cation Manual of the American Psychological Association. Washington,
Klein H, Elifson KW, Sterk CE (2003) Perceived temptation to use drugs and
Berger JO (2003) Could Fisher, Jeffreys, and Neyman have agreed on testing?
actual drug use among women. Journal of Drug Issues 33: 161–192.
Lang JM, Rothman KJ, Cann CI (1998) That confounded p-value. Epidemi-
Cohen J (1994) The earth is round (p < 0.05). American Psychologist 49:
Pearson ES (1990) [posthumously published by Plackett RL, Barnard GA,
David FN, ed (1966) Research Papers in Statistics: Festschrift for J. Neyman.
eds] ‘Student’: A Statistical Biography of William Sealy Gosset. Oxford:
Eisenberger MA, Blumenstein BA, Crawford ED, Miller G, McLeod DG,
Rennie D (1978) Vive la Difference (p < 0.05). New England Journal of
Loehrer PJ, Wilding G, Sears K, Culkin DJ, Thompson IM, Bueschen
AJ, Lowe BA (1998) Bilateral orchiectomy with or without flutamide
Rossi J (1990) Statistical power of psychological research: What have we
for metastatic prostate cancer. New England Journal of Medicine 339:
gained in 20 years? Journal of Consulting and Clinical Psychology 58:
Fidler F (2002) The fifth edition of the APA Publication Manual: Why its
Rothman KJ (1978) A show of confidence. New England Journal of Medicine
statistics recommendations are so controversial. Educational and Psycho-
Rothman KJ (1986) Modern Epidemiology. New York: Little, Brown.
Fidler F, Thomason N, Cumming G, Finch S, Leeman J (2004) Editors can
Rothman KJ (1990) Writing for epidemiology. Epidemiology 9: 333–337.
lead researchers to confidence intervals but they can’t make them think:
Rothman KJ, Johnson ES, Sugano DS (1999) Is flutamide effective in patients
Statistical reform lessons from medicine. Psychological Science 15: 119–
with bilateral orchiectomy? Lancet 353: 1184.
Savitz DA, Tolo K, Poole C (1994) Statistical significance testing in the Amer-
Fisher RA (1922) On the mathematical foundations of theoretical statistics.
ican Journal of Epidemiology, 1970–1990. American Journal of Epidemi-
Philosophical Transactions of the Royal Society A 222: 309–368.
Fisher RA (1926) Bayes’ Theorem. Eugenics Review 18: 32–33.
Shyrock RH (1961) The history of quantification in medical science. Isis 52:
Fisher RA ([1956] 1959) Statistical Methods and Scientific Inference, 2nd ed.
Sterne JAC, Davey Smith G (2001) Sifting the evidence—What’s wrong with
Fleiss JL (1986) Significance tests do have a role in epidemiological research:
significance tests? British Medical Journal 322: 226–231.
Reaction to AA Walker. American Journal of Public Health 76:559–
Zabell S (1989) R. A. Fisher on the history of inverse probability. Statistical
Freiman JA, Chalmers T, Smith H, Kuebler RR (1978) The importance of
Zellner A (1984) Basic Issues in Econometrics. Chicago: University of
beta, the type II error and sample design in the design and interpretation
of the randomized control trial: Survey of 71 negative trials. New England
Ziliak ST, Hannon J (2006) Public assistance: Colonial times to the 1920s. In
Historical Statistics of the United States. (Carter SB, Gartner SS, Haines
Goodman S (1999a) Toward evidence-based medical statistics. 1: The p-value
MR, Olmstead AL, Sutch R, Wright G, eds). New York: Cambridge Uni-
fallacy. Annals of Internal Medicine 130: 995–1004.
Hoover K, Siegler M (2008) Sound and fury: McCloskey and signifi-
Ziliak ST, McCloskey DN (2008) The Cult of Statistical Significance: How
cance testing in economics. Journal of Economic Methodology 15: 1–
the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, MI:
Bayer CropScience SCHEDA DI SICUREZZA secondo il Regolamento (CE) Num. 1907/2006 CURIT TRIO 1. IDENTIFICAZIONE DELLA SOSTANZA/PREPARATO E DELLA SOCIETÀ / IMPRESA Informazioni sul prodotto E-Mail: qhse-italy@bayercropscience.com (Indirizzo di posta elettronica al quale inviare esclusivamente richieste relative ai contenuti tecnici della scheda di sicurezza.) +39 02-3978 2282 (Numer
Actinomycoses . (voir Mycoses)Amibiase. Amibes. ( voir aussi Traitement court de l'. hépatique par le tinidazole. A proposde 10 cas. Epidémiologie des parasitoses intestinales au Laos (avec dosage desanticorps antiamibiens). . Premiers cas de kératites à amibes libres du genre . . hémolytique associée à une salmonellose mineure chez unecongolaise VIH positive déficitaire en G6PD. . Pa