The unreasonable ineffectiveness of fisherian tests in biology, and especially in medicine

The Unreasonable Ineffectiveness of Fisherian“Tests” in Biology, and Especially in Medicine Deirdre N. McCloskey
Abstract
Biometrics has done damage with levels of R or p or Student’s t . The damage widened with Ronald A. Fisher’s victory in the 1920s and 1930s in devising mechanical methods of “test-ing,” against methods of common sense and scientific impact, Stephen T. Ziliak
“oomph.” The scale along which one would measure oomph is particularly clear in biomedical sciences: life or death. Car- diovascular epidemiology, to take one example, combines with gusto the “fallacy of the transposed conditional” and what wecall the “sizeless stare” of statistical significance. Some med-ical editors have battled against the 5% philosophy, as did,for example, Kenneth Rothman, the founder of Epidemiol-ogy. And decades ago a sensible few in education, ecology,and sociology initiated a “significance test controversy.” But,grantors, journal referees, and tenure committees in the statis-tical sciences had faith that probability spaces can substitutefor scientific judgment. A finding of p < .05 is deemed tobe “better” for variable X than p < .11 for variable Y . It isnot. It depends on the oomph of X and Y —the effect size,size judged in the light of how much it matters for scientificor clinical purposes. In 1995 a Cancer Trialists’ CollaborativeGroup, for example, came to a rare consensus on effect size:10 different studies had agreed that a certain drug for treat-ing prostate cancer can increase patient survival by 12%. An11th study published in the New England Journal in 1998 dis-missed the drug. The dismissal was based on a t-test, not onwhat William Gosset (the “Student” of Student’s t) had called,against Ronald A. Fisher’s machinery, “real” error.1 Keywords
Bayesian analysis in medicine, biometrics, Fisher, Gosset, Jef-
freys, level of p, levels of t, Rothman, statistical power in
medical research, statistical significance, tests of significance
October 2, 2008; accepted September 6, 2009 Biological Theory 4(1) 2009, 44–53. c 2009 Konrad Lorenz Institute for Evolution and Cognition Research Deirdre N. McCloskey and Stephen T. Ziliak One wishes to know the probability that a biological or medical probably not a member of Congress. This person is a mem- hypothesis, H , is true in view of the sadly incomplete facts of ber of Congress. Therefore, he is probably not an American” the world. It is a problem of inference, inferring the likelihood (Cohen 1994: 998). Cohen is pointing out that the illogic of of a result from the data. If the symptoms of cholera start being probably-not-an-American is formally exactly the same in the digestive system, then ingestion of something, perhaps as the Fisherian test of significance. And it is mistaken. The foul water, is a probable cause. If cases of cholera in London structure of the logic is hypothesized that P(O | H0) is low; in 1854 cluster around particular public wells, then bad water observe O in the data; conclude therefore that P(H0 | O)—the transposed conditional of the original hypothesis—is low. The But, the statistical tests used in many sciences (though not argument appears at least implicitly in article after article in much in chemistry or physics) do nothing to aid such judg- scientific journals, and explicitly in most statistics textbooks.
ments. The tests that were regularized or invented in the 1920s by the great statistician and geneticist Ronald A. Fisher (1890– Cohen applied the logic to an important topic in psychi- 1962) measure the probability that the facts you are examining atry, the misdiagnosis of schizophrenia. In the United States, will occur assuming that the hypothesis is true. Our point is schizophrenia incidence in adults is about 2%. Like a general that by itself, unless in a decision-theoretic context in which attacked by peasants in 1645, it is rare. Let H0 = the person the other relevant probabilities and their substantive impor- is normal; H1 = the person is schizophrenic, and O = the tance are calculated, such a test is mistaken. The mistake here test result on the person in question is positive for schizophre- is known in statistical logic as “the fallacy of the transposed nia. A proposed screening test is estimated to have at least conditional.” If cholera is caused not by polluted drinking wa- 95% accuracy in making the positive diagnosis (discovering ter but by bad air, then economically poor areas with rotting schizophrenia) and about 97% accuracy in declaring a truly garbage and open sewers will have large amounts of cholera.
normal case “normal.” Formally stated, P(normal | H0) is ap- They do. So, cholera is caused by bad air. If cholera is caused proximately 0.97, and P(schizophrenic | H1) > 0.95.
by person-to-person contagion, then cholera cases will often be With a positive test for schizophrenia at hand, neighbors. They are. So, cholera is caused by person-to-person given the more than 95% assumed accuracy of the test, P(schizophrenic | H0) is less than 5%—statistically significant, If the rebel Chinese general Li Zicheng was in the sum- that is, at p = 0.05. In the face of such evidence, a person in mer of 1645 attacked by angry peasants from whom he was the Fisherian mode would reject the hypothesis of “normal” stealing food, he will be dead. He is dead. Therefore, says the and conclude that the person is schizophrenic. Then he might usual procedure of significance testing, he was attacked by proceed to do all sorts of good and bad things to the “patient.” peasants. If the biological hypothesis, H , is true, then obser- But the probability of the hypothesis, given the data, is vations O will be observed with high statistical significance.
not what has been tested. The probability that the person is O is observed. Therefore, H is true. But, of course, being normal, given a positive test for schizophrenia, is in truth quite dead is very weak evidence that Li Zicheng was attacked by strong—about 60%—not, as Fisherians believe, less than 3%, peasants, considering that by some accounts he committed suicide—and after all there are many ways to die. Statisticallyspeaking, the power of the test of the hypothesis that Li was so attacked is undefined. To be sure, being dead is “consistentwith” the hypothesis that Li was attacked by peasants, as the = [P(Ho) · P(test wrong | Ho)]/{[P(Ho) neo-positivist rhetoric of the Fisherian argument has it. But so ·P(test wrong | Ho)] + [P (H1) · (P test right | H1)]} what? A myriad of other hypotheses, very different from the alleged cause of the general’s death, such as committing sui- [(.98) · (.03)]/[(.98) · (.03) + (.02) · (.95)] = .607, cide or catching pneumonia or breaking his neck in a fall fromhis horse, or dying from heartbreak after losing his campaign a humanly important difference from p = .03. The conditional against the Manchus, are omitted from Fisherian procedures in probability of a case being “normal” though testing positively the statistics-using sciences, though “consistent with” the fact as schizophrenic is, Cohen points out, “not small—of the 50 of his being dead. The Fisherian procedure, at any rate when it cases testing as schizophrenic [out of an imagined population proceeds (as it almost always does) without a loss function and of 1000 people tested], 30 are false positives, actually normal, a full discussion of Type-II error, neither falsifies nor confirms.
The psychologist and statistician, the late Jacob Cohen, The example shows how confused—and humanly and made our point, a very old one, in his aptly entitled article, socially damaging—a conclusion from a Fisherian 5% science “The Earth is Round (p < .05).” “If a person is an American,” can be. One of us has a good friend who as a child in the Cohen writes, in a parody of the Fisherian logic, “then he is psychiatry–spooked 1950s was diagnosed as schizophrenic.
The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine The friend has shown since then no symptom of the disease.
given the hypothesis” to “the hypothesis is unlikely given the data” But the erroneous diagnosis—an automatic result of the fal- without some additional rule of thought. Those that reject inverse lacy of the transposed conditional—has kept him in a state of probability have to replace it by some circumlocution, which leaves dull terror ever since. Imagine in other arenas, with similarly it to the student to spot where the change of data has been slipped realistically low priors, the damage done by the transposed in[, in] the hope that it will not be noticed. (Jeffreys 1963: 409) conditional—in scientific work or diet pills or social welfarepolicy or commercial advertising or the market in foreign ex- The Five Percenter longs to find a body of data “significant change. Once one considers the concrete implications of such and consistent with” some hypothesis. The motive is by itself a large diagnostic error, such as believing that 3% of adults blameless. But Jeffreys noted that the sequence of the Five tested for schizophrenia are not-schizophrenic when the truth Percenter’s search procedure is backwards and paradoxical is that 60% of them are not-schizophrenic, and realizes that, (Jeffreys 1963: 409). The Five Percenter is looking at the after all, this magnitude of diagnostic error is governing NASA and the departments of cardiovascular disease and breast In the 1994 volume of the American Journal of Epi- cancer and HIV health policy, one should perhaps worry.
demiology, David A. Savitz, Kristi-Anne Tolo, and Charles Part of the problem historically was another campaign of Poole examined 246 articles published in the Journal around Fisher’s, following the elder Pearson, Karl: an attempt to kill the years 1970, 1980, and 1990. The articles were divided off Bayes’ Theorem. By contrast, the inventor in 1908 of the into three categories: infectious disease epidemiology, cancer t -test for small samples, the Guinness brewer and theoretical epidemiology, and cardiovascular disease epidemiology. Each statistician William Sealy Gosset, was a lifelong Bayesian. He category contained for each date a minimum of 25 articles.
defended Bayesian methods against all comers—Karl Pearson, The main findings are presented in a Figure 4, “Percent of Fisher, Karl’s son Egon Pearson, Jerzy Neyman (e.g., Gosset articles published in the American Journal of Epidemiology 1915, 1922 cited in Pearson 1990: 26–27). Gosset in fact used classified as partially or completely reliant on statistical sig- Bayes’ Theorem in his revolutionary papers of 1908, and cru- nificance testing for the interpretation of the study results, by cially so in “The Probable Error of a Correlation Coefficient.” topic and time period” (Savitz et al. 1994: 1050). The find- In 1915 he wrote to the elder Pearson: “If I didn’t fear to waste ings are not surprising. The study shows that in 1990 some your time I’d fight you on the a priori probability and give you 60% to 70% of all cardiovascular and infectious disease epi- choice of weapons! But I don’t think the move is with me; I demiologists relied exclusively on statistical significance as a put my case on paper last time I wrote and doubt I’ve much to criterion of epidemiological importance, as though fit were the add to it” (September 1). Gosset was courageous, but in all his same thing as importance. A larger share rely on the fallacy of fights mild and self-deprecating, including for Bayes’ meth- the transposed conditional. The abuse was worse in 1990 than ods. In the warrior culture of hardboiled-dom in the 1910s and 1920s (the Great War mattered) he was not forceful enough.
The cancer researchers were less enchanted with statisti- Fisher was to a great deal more forceful, and wholly in- cal significance than cardiological and infectious disease re- tolerant of “inverse probability” (Fisher 1922, 1926, 1956; cf.
searchers were, but did not reach standards of common sense.
Zabell 1989). In Fisher’s campaigns for maximum likelihood Savitz, Tolo, and Poole found that after a 60% reliance on a and his own notion of “fiducial probability” (one of the few mere statistical significance in the early 1970s, the abuse of campaigns of Fisher’s that failed), he tried to kill off prior and p-values by cancer researchers actually fell. We don’t know posterior probability, and—at least with the mass of research why. Maybe too many people had died. Still, 40% of all the workers as against the few high brows—he succeeded. Egon cancer research articles in 1990 relied exclusively on Fisher’s Pearson and Jerzy Neyman were at first persuaded by Fisher to turn from Bayes’ Theorem (Pearson 1966: 9, in David 1966).
In epidemiology, then, the “sizeless stare,” as we call it, But Pearson later in life, after Fisher died, reverted to his orig- of statistical significance is relatively recent, cancer research inal position: “Today in many circles,” he said, “the current being an exception. In 1970 only about 20% of all articles vogue is a neo-Bayesian one, which is of value because it calls on infectious disease epidemiology relied exclusively on tests attention to the fact that, in decision making, prior information of statistical significance. Confidence intervals and power cal- must not be neglected” (Pearson 1990: 110). Of course.
culations were of course absent. But epidemiology was not In 1963, the geophysicist, astronomer, and mathematical then an entirely statistical science. Only about 40% of all em- statistician Harold Jeffreys wrote the following: pirical articles in infectious disease epidemiology employedsome kind of statistical test. But significance took hold, and Whether statisticians like it or not, their results are used to decide by 1980 some 40% relied exclusively on the tests (compare our between hypotheses, and it is elementary that if p entails q, q does “Question 16” in economics, where in the 1980s it was about not necessarily entail p. We cannot get from “the data are unlikely 70%). And by 1990, most subfields of epidemiology had like Deirdre N. McCloskey and Stephen T. Ziliak economics and psychology become predominately Fisherian.
(Altman 1991: 1900). Editors are much exercised, he observed Statistical significance came to mean “epidemiological signif- with gentle sarcasm, over whether to use “P, p, P , or p values” icance.” Statistical insignificance came to mean “ignore the (1991: 1902)—but pay no heed to oomph. “It is impossibly ide- alistic,” Altman believed, “to hope that we can stop the misuse Douglas G. Altman, a statistician and cancer researcher at of statistics, but we can apply a tourniquet . . . by continuing the Medical Statistics Laboratory in London has been watch- to press journals to improve their ways” (1991: 1908).
ing the use of medical statistics, and especially the deployment Steven Goodman, in a meaty pieces on the “p-value fal- of significance testing, for 20 years. In 1991 Altman published lacy” published in the Annals of Internal Medicine, observed an article called “Statistics in Medical Journals: Developments ruefully, “biological understanding and previous research play in the 1980s.” The article appeared in Statistics in Medicine.
little formal role in the interpretation of quantitative results.” Altman’s experience had been similar to ours in economics.
That is, Bayes’ Theorem is set aside, as is the total quality man- At conferences and seminars and the like Altman’s colleagues agement of medical science, the seeing of results in their con- were convinced that the abuse of t-testing had by the 1980s text of biological common sense. “This [narrowly Fisherian] abated, and was practiced only by the less competent medical statistical approach,” Goodman writes, “the key components scientists. Any thoughtful reader of the journals knew that such of which are P values and hypothesis tests, is widely perceived claims were false. To bias the results in favor of the defend- as a mathematically coherent approach to inference. There is ers of the status quo Altman examined the first 100 “original little appreciation in the medical community that the methodol- articles” published in the 1980s in the New England Journal ogy is an amalgam of incompatible elements (Goodman 1992, of Medicine. These were new and full-length research articles based on never-before released or published data from clinical Altman, Savitz, Goodman, and company are not single- studies or other methods of observation. Altman’s sample de- tons. According to Altman, between 1966 and 1986 fully 150 sign was meant to replicate for comparative purposes an earlier articles were published criticizing the use of statistics in med- study by Emerson and Colditz 1983, who studied the matter ical research (Altman 1991: 1897). The studies agreed that R. A. Fisher significance in medical science had become thenearly exclusive technique for making a quantitative decision The Findings
and that statistical significance had become in the minds ofmedical writers equated increasingly, and erroneously, with It is my impression that the trends noted by Felson et al. have contin- ued throughout the 1980s. . . . The obsession with significant p values As early as 1978 the situation was sufficiently dire that two contributors to the New England Journal of Medicine, (1) Reporting of [statistically] significant results rather than those of Drummond Rennie and Kenneth J. Rothman, published op-ed most importance (especially in abstracts).
pieces in the journal pages about the matter (Rennie 1978; (2) The use of hypothesis tests when none is appropriate (such as for Rothman 1978). Rennie, the deputy editor of the journal— comparing two methods of measurements or two observers).
and in 2006 the deputy editor of the Journal of the Ameri- (3) The automatic equating of statistically significant with clinically can Medical Association—was not critical of his colleagues’ important, and non-significant with non-existent.
(4) The designation of studies that do or do not “achieve” significance practice. But Rothman, who was a young associate professor as “positive” or “negative” respectively, and the common associated at Harvard, and the youngest member of the editorial board, phrase “failed to reach statistical significance”. . . . A review [by other blasted away. In “A Show of Confidence,” he made a crushing investigators <who>] of 142 articles in three general medical journals case for measuring clinical significance, not statistical signifi- found that in almost all cases (1076/1092) researchers’ interpretations cance. Citing the Freiman et al. (1978) article on “71 Negative of the “quantitative” (that is, clinical) significance of their results Clinical Trials,” Rothman argued that the measurement and agreed with statistical significance. Thus across all medical areas and interpretation of size of effects, confidence intervals, and ex- sample size p rules, and p < 0.05 rules most. It is not surprising if amination of power functions with respect to effect size (`a some editors share these attitudes, as most will have passed through la Freiman et al. by graphical demonstration) was the better the same research phase of their careers and some are still active way forward. Rothman—an epidemiologist and biostatistician with a life-long interest in the rhetoric of his fields—wanted Altman was not surprised when he found in medicine, as secretly to ban the t-test altogether. Rennie and the other ed- we were not surprised in economics, that his colleagues were itors decided on a different solution. Original articles would deluding themselves. “I noted in the first issue of Statistics be subjected to a pre-publication screening by a professional in Medicine that most journals gave much more attention to statistician. Rothman was at first hopeful, thinking statisti- the format of references in submitted articles than they gave cal review would repair the Journal. The director of statis- to the statistical content,” Altman wrote. “This remains true” tical reviews was well chosen—the late Frederick Mosteller The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine (1916–2006), the founder of Harvard’s Statistics Department When writing for Epidemiology, you can . . . enhance your prospects and a giant of 20th-century data analysis. But Mosteller was if you omit tests of statistical significance. . . . In Epidemiology, we only the director, not the worker. Rothman tells us that he as the do not publish them at all. . . . We discourage the use of this type of inside critic and Mosteller as the outside director had not been thinking in the data analysis, such as in the use of stepwise regression.
able to do anything together to raise the standards (Mosteller to We also would like to see the interpretation of a study based not onstatistical significance, or lack of it, for one or more study variables, Ziliak and McCloskey, University of Chicago, 21 May 2005; but rather on careful quantitative consideration of the data in light KJ Rothman to Ziliak, 30 January 2006). The problem with of competing explanations for the findings. For example, we prefer a pre-publication statistical review, of course, is that the arti- researcher to consider whether the magnitude of an estimated effect cles go not to the Rothmans and Mostellers and Kruskals but could be readily explained by uncontrolled confounding or selection out to Promising Young Jones in the outer office dazzled by biases, rather than simply to offer the uninspired interpretation that the his recently mastered 5% textbooks. An example nowadays is estimated effect is “significant.”. . . Misleading signals occur when a the “Statistical Analysis Plan” or, aptly acronymized, “SAP,” trivial effect is found to be “significant,” as often happens in large which lays down the minimum statistical criteria considered studies, or when a strong relation is found “nonsignificant,” as often acceptable by the Food and Drug Administration.
happens in small studies. (Rothman 1990: 334) Like Gosset, Jeffreys, and Zellner, Rothman doubted the philosophical grounding of p values (Rothman 1990: 334). As Rothman concluded the letter by offering advice on how to publish quantitatively, epidemiologically significant figures,such as odds ratios on specific medical risks, bounded byconfidence intervals.
If P is small, that means that there have been unexpectedly large de- Now with his own journal, Rothman was going to get it partures from prediction [under the null hypothesis]. But why should right. In January 1990 he and the associate editors Janet Lang these be stated in terms of P? The latter gives the probability of de-partures, measured in a particular way, equal to or greater than the and Cristina Cann published another luminous editorial, “That observed set, and the contribution from the actual value [of the test Confounded P -Value” (Lang et al. 1998). They “reluctantly” statistic] is nearly always negligible. What the use of P implies, there- (p. 8) agreed to publish p-values when “no other” alterna- fore, is that a hypothesis that may be true may be rejected because tive was at hand. But they strongly suggested that authors it has not predicted observable results that have not occurred. This of submitted manuscripts illustrate “size of effect” (p. 7) in seems a remarkable procedure. On the face of it the fact that such “figures”—in plots of effect size lines against well-measured results have not occurred might more reasonably be taken as evidence for the law [or null hypothesis], not against it. The same applies to Rothman and his associates were and are not alone, even all the current significance tests based on P integrals. (Jeffreys 1961, in epidemiology. The statistician James O. Berger (2003) has quoted by Zellner 1984: 288; emphasis in original; editorial insertions recently shown how epidemiologists and other sizeless scien- tists go wrong with p-values. Use of Berger’s applet, a public-access program, shows Rothman’s skepticism to be empiri- Rothman complained in his editorial in the New England cally sound (http://www.stat.duke.edu/∼berger). The program Journal that Fisherian “testing . . . is equivalent to funneling simulates a series of tests, recording how often a null hypoth- all interest into the precise location of one boundary of a con- esis is “true” in a range of different p-values. Berger cites a fidence interval” (Rothman 1978: 1363). In 1986 the situation 2001 study by the epidemiologists Sterne and Davey Smith, was the same: “Declarations of ‘significance’ or its absence which found that “roughly 90% of the null hypotheses in the can supplant the need for any real interpretation of data; the epidemiology literature are initially true.” Berger reports that declarations can serve as a mechanical substitute for thought, even when p “is near 0.05, at least 72%—and typically over promulgated by the inertia of training and common practice” 90%” of the null hypotheses will be true (Sterne and Davey Smith 2001; Berger 2003: 4). Berger agrees with Rothman Rothman then became assistant editor of the American and the authors here that on the contrary “true” is a matter of Journal of Public Health. The chief editor of the American judgment—a judgment of epidemiological, not mere statisti- Journal of Public Health “seemed to be sympathetic” with cal, significance. It is about the quality of the water from the Rothman’s views—Rothman recalls one time when the edi- tor backed him up in a little feud with a well-placed statisti- Rothman’s letter itself elicited no response. This is our cian. Still, Rothman’s views hardly set journal policy, and it experience, too: Many of the Fisherians, to put it bluntly, seem shows in the journal. Rothman finally found his chance when to be less than courageous in defending their views. Hardly in 1990, after 15 years of quiet struggle, he started his own jour- ever have we seen or heard an attempt to provide a coherent— nal, Epidemiology. His editorial letter to potential authors was or indeed any—response to the case against null-hypothesis testing for “significance.” The only published response that Deirdre N. McCloskey and Stephen T. Ziliak Rothman can recollect in epidemiology came from J. L. Fleiss, of Epidemiology report confidence intervals “inferences are a prominent biostatistician, in the American Journal of Public made regarding statistical significance tests, often based on Health published in 1986. But Fleiss merely complained that the location of the null value with[out] respect to the bounds “an insidious message is being sent to researchers in epidemi- of the confidence interval” (1994: 1051). In other words, say ology that tests of significance are invalid and have no place Fidler and her coauthors, confidence intervals “were simply in their research” (Fleiss 1986: 559). He gave no actual argu- used to do [the null hypothesis testing ritual]” (Fidler et al.
ments for giving Fisherian practices a place in research. This is similar to our experience. Kevin Hoover and Mark Siegler Fidler and her coauthors (2004) attempted as we have to offered in 2005 (published 2008, with our detailed reply) the assemble outside allies. They “sought lessons for psychology only written response to our complaints in economics that we from medicine’s experience with statistical reform by investi- have seen. Courageous though it was for them to venture out gating two attempts by Kenneth Rothman to change statistical in defense of the Fisherian conventions, a sterling exception to practices.” They examined 594 American Journal of Public the spinelessness of their colleagues, they could offer no actual Health articles published between 1982 and 2000 and 110 arguments (though they did catch us in a most embarrassing Epidemiology articles published in 1990 and 2000: failure to take all the data from the American Economic Reviewin the 1990s). Hoover and Siegler merely wax wroth for many Rothman’s editorial instruction to report confidence intervals and not p values was largely effective: In AJPH, sole reliance on p values Even the rare courageous Fisherians, in other words, do dropped from 63% to 5%, and confidence interval reporting rosefrom 10% to 54%; Epidemiology showed even stronger compliance.
not deign to make a case for their procedures. They merely However, compliance was superficial: Very few authors referred to complain that the procedures are being criticized. “Other de- confidence intervals when discussing results. The results of our survey fenses of [null hypothesis significance testing],” Fidler et al.
support what other research has indicated: Editorial policy alone is observed, “are hard to find” (Fidler et al. 2004: 121). The Fish- not a sufficient mechanism for statistical reform. (Fidler et al. 2004: erians, being comfortably in control, appear inclined to leave things as they are, sans argument. One can understand. If youdon’t have any arguments for an intellectual habit of a lifetime, Rothman himself has said of his attempt to reduce p-value reporting in his Epidemiology that “my revise-and-resubmit Rothman’s campaign did not succeed. Fidler et al. (2004) letters . . . were not a covert attempt to engineer a new policy, found, as we and others have found in economics and psy- but simply my attempt to do my job as I understood it. Just chology and in other fields of medicine, that epidemiology is as I corrected grammatical errors, I corrected what I saw as getting worse, despite Rothman’s letter. Over 88% of more conceptual errors in describing data” (quoted in Fidler et al.
than 700 articles they reviewed in Epidemiology (between 1990 and 2000) and the American Journal of Public Health Fidler’s team studied the American Journal of Public (between 1982 and 2000) failed, they find, to distinguish and Health and Epidemiology before, during, and after Rothman’s interpret substantive significance. In the American Journal of editorial stints; before and after the International Committee Public Health, some 90% confused a statistically significant of Medical Journal Editors creation of statistical regulations result with an epidemiologically significant result, and equated encouraging the analysis of effect size; and before and after the statistical insignificance with substantive unimportance. Epi- changes to the AJPH’s “Instructions to Authors” encouraging demiology journals, in other words, performed worse than the the use of confidence intervals. Rothman as an assistant editor, New England Journal of Medicine, Rothman’s training-ground of course, did not make policy at the journal. He made his own preferences known to authors, but ultimately he “carried out Fidler and her coauthors (2004) observe that for decades the editor’s policy,” which only occasionally overlapped with “advocates of statistical reform in psychology have recom- Rothman’s ideal (Rothman to Ziliak, email communication, mended confidence intervals as an alternative (or at least a supplement) to p values.” The American Psychological As- Fidler et al. counted a statistical practice “present,” such as sociation Publication Manual called them in 2001 “the best what we call “asterisk biometrics,” the ranking of coefficients reporting strategy,” though few seem to be paying attention according to the size of the p-value, if an article contained (APA Manual 2001: 22 in Fidler et al. 2004: 119; Fidler 2002).
at least one instance of it. Their full questionnaire is simi- Since the mid-1980s, confidence intervals have been widely lar to ours in economics (Ziliak and McCloskey 2008: 62– reported in medical journals. Unhappily, requiring the calcula- 92), focusing on substantive as against statistical significance tion of confidence intervals does not guarantee that effect sizes testing. Did “significant” mean “epidemiologically important” will be interpreted more carefully, or indeed at all. Savitz et al.
or “statistically significant”? Practice was recorded as am- find that even though 70% of articles in the American Journal biguous if the author or authors did not preface “significant” The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine with “statistically,” follow the statement of significance di- 1961, though, doctors have lost many of their skills of physi- rectly with a p-value or test statistic, or otherwise differentiate cal assessment, even with the stethoscope (and certainly with between statistical and substantive interpretations. “Explicit their hands), and have come to rely on a medical literature power” in their checklist means “did a power calculation.” deeply infected with Fisherianism. Shyrock’s piece appeared “Implicit power” means some mention of a relationship be- in a special issue of Isis on the history of quantification in tween sample size, effect size, and statistical significance was the sciences, mostly celebrating the statistical side of it. Puz- made—for example, a reference to small sample size as per- zlingly, none of the contributors to the symposium mentioned haps explaining failure to find statistical significance. The re- the Gosset-Fisher-Neyman-Pearson-Jeffreys-Deming-Savage sults, alas, “Of the 594 AJPH articles, 273 (46%) reported complex. Fisher-significance, the omission suggests, was not NHST. In almost two thirds of the cases ‘significant’ was to be put on trial. The inference machines remained broken.
used ambiguously. Only 3% calculated power and 15% re- By 1988 the International Committee of Medical Jour- ported ‘implied power.’ . . . An overwhelming 82% of NHST nal Editors had been sufficiently pressured by the Rothmans articles had neither an explicit nor implicit reference to statisti- and the Altmans to revise their “uniform requirements for cal power, even though all reported at least one non-significant manuscripts submitted to biomedical journals.” “When pos- sible,” the Committee wrote, “quantify findings and present Fifty-four percent of American Journal of Public Health them with appropriate indicators of measurement error or un- articles reported confidence intervals; 86% did in Epidemiol- certainty (such as confidence intervals). Avoid sole reliance on ogy. But “Table 2 shows that fewer than 12% of AJPH articles statistical hypothesis testing, such as the use of p values, which with confidence intervals interpreted them and that, despite fail to convey important quantitative information” (ICMJE fully 86% of articles in Epidemiology reporting confidence in- 1988: 260). The formulation is not ideal. The “error” in ques- tervals, interpretation was just as rare in that journal” (Fidler tion is tacitly understood to be sampling error alone, when et al. 2004: 122). The situation, they find, did not improve after all a good deal of error does not arise from the smallness with the years. The authors usually did not refer in their texts of samples. “Avoid sole reliance” on the significance error to the width of their confidence intervals, and did not dis- should be “don’t commit” the significance error. The “impor- cuss what is epidemiologically or biologically or socially, or tant quantitative information” is effect size, which should have clinically significant in the size of the effect. In other words, been mentioned explicitly. Still, it was a good first step, and in during the past two decades more than 600 of some 700 articles 1988 among the sizeless sciences was amazing.
published in the leading journals of public health and epidemi- The Requirements—on which at a formative stage ology showed no concern with epidemiological significance.
Rothman among others had contributed an opinion—were Thus too economics, sociology, population biology, and other widely published. They appeared for instance in the An- nals of Internal Medicine—where later the Vioxx study was When in 2000 Rothman left his post as editor of Epi- published—and in the British Medical Journal. More than demiology, confidence-interval reporting remained high—it 300 medical and biomedical journals, including the American had become common in medical journals. But in the American Journal of Public Health, notified the International Committee Journal of Public Health reporting of unqualified p “again be- of their willingness to comply with the manuscript guidelines came common.” Rothman’s success at Epidemiology appears (Fidler et al. 2004: 120). But the Requirements have not helped.
to have been longer lasting. Still, interpretation in other jour- The essence of the problem of reform—and the proof nals of epidemiology is rare. “In both journals [Fidler et al.
that we need to change academic and institutional incen- should add ‘but not in Epidemiology’] . . . when confidence in- tives, including criteria for winning grants—is well illus- tervals were reported, they were rarely used to interpret results trated in a study of “temptation to use drugs” published in or comment on [substantive] precision. This rather ominous the Journal of Drug Issues. The study was financed by the finding holds even for the most recent years we surveyed” Centers for Disease Control. It was authored by two pro- (Fidler et al. 2004: 123). Fidler and her team confirm in thou- fessors of public health at Emory University (one of them sands of tests what Savitz et al. (1994) found in the American was an Associate Dean for Research), and a third professor, Journal of Epidemiology in tens of thousands of tests and what a medical sociologist at Georgia State University. The study Rossi found in 39,863 tests in psychology and speech and ed- was conducted in Atlanta between August 1997 and August ucation and sociology, and management (Rossi 1990: 648).
2000. Its subjects were African-American women—mothers The historian of medicine Richard Shyrock argued in an and their daughters—living in low-income neighborhoods of early paper that instruments such as the stethoscope and the Atlanta (Klein et al. 2003: 167). The dependent variable was X-ray machine saved some parts of medicine from the Fish- “frequency-of-[drug] use and times-per-day” multiplied for erian pitfall. If one can see or hear the problem, one does each drug type and summed by month. In the 125 women not need to rely on correlations (Shyrock 1961: 228). Since studied the value of the dependent variable ranged from zero Deirdre N. McCloskey and Stephen T. Ziliak to 910, that is, from zero to an appalling 30 drug doses a or policy oomph of such a temptations-to-use-drugs study? In September 1978 Jennie A. Freiman, Thomas C.
Statistical Significance Decides Everything
Chalmers, Harry Smith, Jr., and Roy R. Kuebler, doctors andstatistical researchers at Mt. Sinai in New York, published in Initially, each of the temptations-to-use drugs variables was entered the New England Journal of Medicine a study entitled “The into simple regression equations, to determine if they were statis- Importance of Beta, the Type II Error and Sample Size in the tically significant predictors of the outcome measure. Next, those Design and Interpretation of the Randomized Control Trial.” found to be related to amount of drug use reported were entered si- multaneously into a stepwise multiple regression equation. . . . Next,the bivariate relationships between the other predictor variables listed Seventy-one “negative” randomized control trials were re-examined earlier were examined one by one, using Student’s t tests whenever to determine if the investigators had studied large enough samples to the independent variable was dichotomous. . . . Items that were found give a high probability (>0.90) of detecting a 25 per cent and 50 per to be marginally—or statistically—significant predictors in these bi- cent therapeutic improvement in the response. Sixty-seven of the trials variate analyses were selected for entry into the multivariate equation.
had a greater than 10 per cent risk of missing a true 25 per cent ther- apeutic improvement, and with the same risk, 50 of the trials couldhave missed a 50 per cent improvement. Estimates of 90 per cent con- The authors do at least report mean values of the temp- fidence intervals for the true improvement in each trial showed that in tations to use drugs—a first step in determining substantive 57 of these “negative” trials, a potential 25 per cent improvement was significance. For example, they report that women were “least possible, and 34 of the trials showed a potential 50 per cent improve- tempted to use drugs when they were: talking and relaxing ment. Many of the therapies labeled as “no different from control” in (74.0%), experiencing withdrawal symptoms (73.3%), [and] trials using inadequate samples have not received a fair test. Concern waking up and facing a difficult day (70.7%). And they would for the probability of missing an important therapeutic improvement be tempted “quite a bit” or “a lot” when they were “with a because of small sample sizes deserves more attention in the planning partner or close friend who was using drugs (38.5%)” or when of clinical trials. (Freiman et al. 1978: 690; italics supplied) “seeing another person using and enjoying drugs (36.1%)” Freiman, who is a specialist in obstetrics and gynecology, (Klein et al. 2003: 170). Here is how they presented their and her colleagues, in other words, had reanalyzed 71 articles in medical journals. Heart and cancer-related treatments dom-inated the clinical trials under review. Each of the 71 articles When examined in bivariate analyses, 15 of the 16 temptations-to- concluded that the “treatment”—for example, “chemotherapy” use drugs items were found to be associated [that is, the authors or “an aspirin pill”—performed no better in a clinical sense assert, statistically significantly related with; not substantively signif-icantly related] with actual drug use. These were: while with friends than did the “control” of nontreatment or a placebo. That is, at a party (p < .001), while talking and relaxing (p < .001), while with a partner or close friend who is using drugs (p < .001), while Freiman et al. (1978) found that if the authors of the origi- hanging around the neighborhood (p < .001), when happy and cele- nal studies had considered the power of their tests—the proba- brating (p < .001), when seeing someone using and enjoying drugs bility of rejecting the null hypothesis “[treatment] no different (p < .05), when waking up and facing a tough day (p < .001), from control” as the treatment effect moves in the direction of when extremely anxious and stressed (p < .001), when bored (p < “vast improvement”—and in conjunction with effect size, the .001), when frustrated because things are not going one’s way (p < experiments would not have ended “negatively.” That is, the .001), when there are arguments in one’s family (p < .05), when in a clinicians conducting the original studies would have found place where everyone is using drugs (p < .001), when one lets down that indeed the treatment therapy was capable of producing concerns about one’s health (p < .05), when really missing the drug habit and everything that goes with it (p < .010), and while experi- Specifically, Freiman et al. (1978) found that if fully 50 encing withdrawal symptoms (p < .01). (Klein et al. 2003: 171–172) of the 71 trials had paid attention to power and effect size and “The only item that was not associated with the amount not merely to a one-sided, qualitative, yes/no interpretation of drugs women used,” the article concluded, “was ‘when one of “significance,” they would have reversed their conclusions.
realized that stopping drugs was extremely difficult’ ” (Klein Astonishingly, they would have found up to “50 per cent im- et al. 2003: 172). This is surely a joke, some will think, perhaps provement” in “therapeutic effect.” The Fisherian tests of sig- a belated retaliation for the 1990s Social Text scandal, in which nificance, the only tests employed by the original authors of a scientist posed as a postmodern theorist in order to expose the 71 studies, literally could not see the beneficial effects of its intellectual pretense. It’s not. It’s normal science in biol- the therapies under study, though staring at them.
ogy, medicine, psychiatry, economics, psychology, sociology, The precise standard of improvement—the minimum education, and many other fields. But what is the scientific standard of oomph the authors set—is a “reduction in mortality The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine from the control [group] mortality rate,” a baseline rate of 29.7 treatment of prostate cancer—could increase the likelihood of per cent (Freiman et al. 1978: 691). They realize, it is not a patient survival by an average of 12% (the 95% confidence very strict standard of medical oomph. They are bending over interval in the pooled data put an upper bound on flutamide- backwards not to find their colleagues mistaken. Like Gosset, enhanced survival at about 20% [Rothman et al. 1999]). Odds they want to give their Fisherian colleagues the benefit of the of 5 in 100 are not the best news to deliver to a prostate patient.
But if castration followed by death is the next best alterna- Yet, they found that 70% of the alleged “negative” trials tive, a noninvasive 12% to 20% increase in survival sounds were stopped, missing an opportunity to reduce the mortality of their patients by up to 50%. Of the patients who were But in 1998 the results of still another, eleventh trial were prescribed sugar pills or otherwise dismissed, in other words, published in the New England Journal of Medicine (Eisen- about 30% died unnecessarily. In one typical article the authors berger et al. 1998: 1036–1042). The authors of the new study in fact missed at α = 0.05 a 25% reduction in mortality with found a similar size effect. But when the two-sided p-value for probability of about 0.77 and, at the same level of Type-I error, their odds ratio came in at 14, they dismissed the efficacious a 50% reduction with probability about 0.42 (Freiman et al.
drug, concluding “no clinically meaningful improvement” (pp.
1036, 1039). Kenneth Rothman, Eric Johnson, and David Sug- Each of the 71 experiments was shut down on the belief ano (1999) examined the individual and pooled results of the that a 30% death rate was equally likely with the sugar pill 11 separate studies, including the study conducted by Eisen- (or whatever the control was) and with the treatment therapy, spurning opportunities to save lives. The article shows that in One might suspect that [Eisenberger et al.’s] findings were at odds the original experiments as few as 15% of the patients receiving with the results from the previous ten trials, but that is not so. From the treatment therapy would have died had the experiment 697 patients randomized to flutamide and 685 randomized to placebo, continued—half as many as actually died.
Eisenberger and colleagues found an OR of 0·87 (95% CI 0·70– We agree with Rothman that the article seems in the end to 1·10), a value nearly identical to that from the ten previous studies.
lose contact with the effect size, at times advising that power be Eisenberger’s interpretation that flutamide is ineffective was based on treated “dichotomously” and rigidly irrespective of effect size absence of statistical significance. (Rothman et al. 1999: 1184) (Rothman and Ziliak, personal interview, 30 January 2006).
Rothman and his coauthors depict the flutamide effect “Important information can be found on the edges,” as Roth- graphically in a manner consistent with a Gosset-Jeffreys- man put it. But overall, Rothman and we agree that it’s a crush- Deming approach. That is, they pool the data of the separate ing piece. The oomph-laden content of their work is exemplary.
studies and plot the flutamide effect (measured by an odds Freiman and her colleagues note that the experiments and 71 ratio, or the negative of the survival probability in a hazard oomph-less, premature truncations were conducted by leading function) against a p-value function. With the graphical ap- medical scientists. Such premature results were published in proach, Rothman and his coauthors are able to show pictorially Lancet, the British Medical Journal, the New England Journal how the p-values vary with increasingly positive and increas- of Medicine, the Journal of the American Medical Association, ingly negative large effects of flutamide on patient survival.
and other elite journals. Effective treatments for cardiovascu- And what they show is substantively significant: lar and cancer, and gastrointestinal patients were abandonedbecause they did not attain statistical significance at the 5% or Eisenberger’s new data only reinforce the findings from the earlier studies that flutamide provides a small clinical benefit. Adding the In 1995 the authors of 10 independent and randomized latest data makes the p value function narrower, which is to say clinical trials involving thousands of patients in treatment and that the overall estimate is now more precise, and points even more control groups had come to an agreement on an effect size.
clearly to a benefit of about 12% in the odds of surviving for patients Consensus on a mere direction of effect—up or down, positive or negative—is rare enough in science. After four centuries of Rothman et al. (1999) conclude, “the real lesson” from the public assistance for the poor in the United States and Western latest study is “that one should eschew statistical significance Europe for example, economists do not speak with one voice on testing and focus on the quantitative measurement of effects.” the direction of effect on labor supply exerted by tax-financed That sounds right. Statistical significance is spoiling income subsidies (Ziliak and Hannon 2006). Medicine is no biological science, is undermining medical treatment, and different. Disagreement on the direction of effect—let alone is killing people. It is leaving a great deal, shall we say, the size of effect—is more rule than exception.
So the Prostate Cancer Trialists’ Collaborative Group was understandably eager to publicize the agreement. Each of the 10 studies showed that a certain drug “flutamide”—for the 1. This paper is a revision of chapters 14–16 in Ziliak and McCloskey 2008.
Deirdre N. McCloskey and Stephen T. Ziliak References
International Committee of Medical Journal Editors (ICMJE) (1988) Uniform requirements for . . . statisticians and biomedical journal editors. Statistics Altman DG (1991) Statistics in medical journals: Developments in the 1980s.
Statistics in Medicine 10: 1897–1913.
Jeffreys H (1963) Review of L. J. Savage, et al., The Foundations of Statistical American Psychological Association (APA) 1952 to 2001 [revisions] Publi- Inference (Methuen, London and Wiley, New York, 1962). Technometrics cation Manual of the American Psychological Association. Washington, Klein H, Elifson KW, Sterk CE (2003) Perceived temptation to use drugs and Berger JO (2003) Could Fisher, Jeffreys, and Neyman have agreed on testing? actual drug use among women. Journal of Drug Issues 33: 161–192.
Lang JM, Rothman KJ, Cann CI (1998) That confounded p-value. Epidemi- Cohen J (1994) The earth is round (p < 0.05). American Psychologist 49: Pearson ES (1990) [posthumously published by Plackett RL, Barnard GA, David FN, ed (1966) Research Papers in Statistics: Festschrift for J. Neyman.
eds] ‘Student’: A Statistical Biography of William Sealy Gosset. Oxford: Eisenberger MA, Blumenstein BA, Crawford ED, Miller G, McLeod DG, Rennie D (1978) Vive la Difference (p < 0.05). New England Journal of Loehrer PJ, Wilding G, Sears K, Culkin DJ, Thompson IM, Bueschen AJ, Lowe BA (1998) Bilateral orchiectomy with or without flutamide Rossi J (1990) Statistical power of psychological research: What have we for metastatic prostate cancer. New England Journal of Medicine 339: gained in 20 years? Journal of Consulting and Clinical Psychology 58: Fidler F (2002) The fifth edition of the APA Publication Manual: Why its Rothman KJ (1978) A show of confidence. New England Journal of Medicine statistics recommendations are so controversial. Educational and Psycho- Rothman KJ (1986) Modern Epidemiology. New York: Little, Brown.
Fidler F, Thomason N, Cumming G, Finch S, Leeman J (2004) Editors can Rothman KJ (1990) Writing for epidemiology. Epidemiology 9: 333–337.
lead researchers to confidence intervals but they can’t make them think: Rothman KJ, Johnson ES, Sugano DS (1999) Is flutamide effective in patients Statistical reform lessons from medicine. Psychological Science 15: 119– with bilateral orchiectomy? Lancet 353: 1184.
Savitz DA, Tolo K, Poole C (1994) Statistical significance testing in the Amer- Fisher RA (1922) On the mathematical foundations of theoretical statistics.
ican Journal of Epidemiology, 1970–1990. American Journal of Epidemi- Philosophical Transactions of the Royal Society A 222: 309–368.
Fisher RA (1926) Bayes’ Theorem. Eugenics Review 18: 32–33.
Shyrock RH (1961) The history of quantification in medical science. Isis 52: Fisher RA ([1956] 1959) Statistical Methods and Scientific Inference, 2nd ed.
Sterne JAC, Davey Smith G (2001) Sifting the evidence—What’s wrong with Fleiss JL (1986) Significance tests do have a role in epidemiological research: significance tests? British Medical Journal 322: 226–231.
Reaction to AA Walker. American Journal of Public Health 76:559– Zabell S (1989) R. A. Fisher on the history of inverse probability. Statistical Freiman JA, Chalmers T, Smith H, Kuebler RR (1978) The importance of Zellner A (1984) Basic Issues in Econometrics. Chicago: University of beta, the type II error and sample design in the design and interpretation of the randomized control trial: Survey of 71 negative trials. New England Ziliak ST, Hannon J (2006) Public assistance: Colonial times to the 1920s. In Historical Statistics of the United States. (Carter SB, Gartner SS, Haines Goodman S (1999a) Toward evidence-based medical statistics. 1: The p-value MR, Olmstead AL, Sutch R, Wright G, eds). New York: Cambridge Uni- fallacy. Annals of Internal Medicine 130: 995–1004.
Hoover K, Siegler M (2008) Sound and fury: McCloskey and signifi- Ziliak ST, McCloskey DN (2008) The Cult of Statistical Significance: How cance testing in economics. Journal of Economic Methodology 15: 1– the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, MI:

Source: http://www.deirdremccloskey.com/docs/fisherian.pdf

Microsoft word - rehsfil0.wwi

Bayer CropScience SCHEDA DI SICUREZZA secondo il Regolamento (CE) Num. 1907/2006 CURIT TRIO 1. IDENTIFICAZIONE DELLA SOSTANZA/PREPARATO E DELLA SOCIETÀ / IMPRESA Informazioni sul prodotto E-Mail: qhse-italy@bayercropscience.com (Indirizzo di posta elettronica al quale inviare esclusivamente richieste relative ai contenuti tecnici della scheda di sicurezza.) +39 02-3978 2282 (Numer

Indxmc9

Actinomycoses . (voir Mycoses)Amibiase. Amibes. ( voir aussi Traitement court de l'. hépatique par le tinidazole. A proposde 10 cas. Epidémiologie des parasitoses intestinales au Laos (avec dosage desanticorps antiamibiens). . Premiers cas de kératites à amibes libres du genre . . hémolytique associée à une salmonellose mineure chez unecongolaise VIH positive déficitaire en G6PD. . Pa