One of the most discussed, debated and controversial issues in medical research is the p-value. I came across Ronald Fisher’s take on this matter in his work, Statistical Methods for the Research Worker, 1925 (the text of this work has been made available online by Christopher D. Green of the York University):
“The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.”
Further, in 1926, Fisher, in his work, The Arrangement of Field Experiments, published in the Journal of the Ministry of Agriculture of Great Britain, states:
… it is convenient to draw the line at about the level at which we can say: “Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials.”…
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.
Two things clearly emerge from the writings of Fisher in this period:
b. In the last sentence of the quote above, Fisher states something which has slipped the attention of the current crop of medical researchers: reproducibility of results. In our pursuit of novelty, we have forgotten the fact that it is still essential to be able to test and re-test hypotheses. This is particularly crucial for medical research and public health research, where a number of extraneous/external factors may be at play.
All said and done, the holy grail of statistical significance, the deeply sought after p-value, was basically an empiric decision based on the assumptions of one person (albeit a great person!). The fact that we have been using it without major changes over the last almost a hundred years goes a long way, empirically, to prove that it was not a very bad assumption to begin with. But this should serve as a reminder, especially for clinical and public health researchers, that statistics is a tool, and not the only one in our decision-making toolbox.