I asked Rick Gerkin to write a summary of his recent eLife paper commenting on a much hyped Science paper on how many odours we can discriminate.

**On Proving Too Much in Scientific Data Analysis**

by Richard C. Gerkin

First off, thank you to Carson for inviting me to write about this topic.

Last year, Science published a paper by a group at Rockefeller University claiming that humans can discriminate at least a trillion smells. This was remarkable and exciting because, as the authors noted, there are far fewer than a trillion mutually discriminable colors or pure tones, and yet olfaction has been commonly believed to be much duller than vision or audition, at least in humans. Could it in fact be much sharper than the other senses?

After the paper came out in *Science*, two rebuttals were published in *eLife*. The first was by Markus Meister, an olfaction and vision researcher and computational neuroscientist at Cal Tech. My colleague Jason Castro and I had a separate rebuttal. The original authors have also posted a re-rebuttal of our two papers (mostly of Meister's paper, which has not yet been peer reviewed. Here I'll discuss the source of the original claim, and the logical underpinnings of the counterclaims that Meister, Castro, and I have made.

How did the original authors support their claim in the *Science* paper? Proving this claim by brute force would have been impractical, so the authors selected a representative set of 128 odorous molecules and then tested a few hundred random 30-component mixtures of those molecules. Since many mixture stimuli can be constructed in this way but only a small fraction can be practically tested, they tried to extrapolate their experimental results to the larger space of possible mixtures. They relied on a statistical transformation of the data, followed by a theorem from the mathematics of error-correcting codes, to estimate - from the data they collected - a lower bound on the actual number of discriminable olfactory stimuli.

The two rebuttals in eLife are mostly distinct from one another but have a common thread: both effectively identify the *Science* paper's analysis framework with the logical fallacy of ` proving too much', which can be thought of as a form of *reductio ad absurdum*. An argument `proves too much' when it (or an argument of parallel construction) can prove things that are known to be false. For example, the 11th century theologian St. Anselm's ontological argument [ed note: see previous post] for the existence of god states (in abbreviated form): "God is the greatest possible being. A being that exists is greater than one that doesn't. If God does not exist, we can conceive of an even greater being, that is one that does exist. Therefore God exists". But this proves too much because the same argument can be used to prove the existence of the greatest island, the greatest donut, etc., by making arguments of parallel construction about those hypothetical items, e.g. "The Lost Island is the greatest possible island..." as shown by Anselm's contemporary Gaunilo of Marmoutiers. One could investigate further to identify more specific errors in logic in Anselm's argument, but this can be tricky and time-consuming. Philosophers have spent centuries doing just this, with varying levels of success. But simply showing that the argument proves too much is sufficient to at least call the conclusion into question. This makes `proves too much' a rhetorically powerful approach. In the context of a scientific rebuttal, leading with a demonstration that this fallacy has occurred piques enough reader interest to justify a dissection of more specific technical errors. Both *eLife* rebuttals use this approach, first showing that the analysis framework proves too much, and then exploring the source(s) of the error in greater detail.

How does one show that a particular detailed mathematical analysis `proves too much' about experimental data? Let me reduce the analysis in the *Science* paper to the essentials, and abstract away all the other mathematical details. The most basic claim in that paper is based upon what I will call `the analysis framework':

The authors did three basic things. First, they extracted a critical parameter

from their data set using a statistical procedure I'll call . represents an average threshold for discriminability, corresponding to the number of components by which two mixtures must differ to be barely discriminable. Second, they fed this derived value, , into a function, that produces a number of odorous mixtures . Finally, they argued that the number so obtained necessarily underestimates the `true' number of discriminable smells, owing to the particular form of . Each step and proposition can be investigated:1) How does the quantity

behave as the data or form of varies? That is, is the `right thing' to do to the data?2) What implicit assumptions does make about the sense of smell - are these assumptions reasonable?

3) Is the stated inequality - which says that any number derived using will always underestimate the true value z - really valid?

What are the rebuttals about? Meister's paper rejects the equation 2 on the grounds that

is unjustified for the current problem. Castro and I are also critical of , but focus more on equations 1 and 3, criticizing the robustness of and demonstrating that the inequality should be reversed (the last of which I will not discuss further here). So together we called everything about the analysis framework into question. However, all parties are enthusiastic about the data itself, as well as its importance, so great care should be taken to distinguish the quality of the data from the validity of the interpretation.In Meister's paper, he shows that the analysis framework proves too much by using simulations of simple models, using either synthetic data or the actual data from the original paper. These simulations show that the original analysis framework can generate all sorts of values for

which are known to be false by construction. For example, he shows that a synthetic organism constructed to have 3 odor percepts necessarily produces data which, when the analysis framework is applied, yield values of . Since we know by construction that the correct answer is 3, the analysis framework must be flawed. This kind of demonstration of `proving too much' is also known by the more familiar term `positive control': a control where a specific non-null outcome can be expected in advance if everything is working correctly. When instead of the correct outcome the analysis framework produces an incredible outcome reminiscent of the one reported in the*Science*paper, then that framework proves too much.

Meister then explores the reason the equations are flawed, and identifies the flaw in

. Imagine making a map of all odors, wherein similar-smelling odors are near each other on the map, and dissimilar-smelling odors are far apart. Let the distance between odors on the map be highly predictive of their perceptual similarity. How many dimensions must this map have to be accurate? We know the answer for a map of color vision: 3. Using only hue (H), saturation (S), and lightness (L) any perceptible color can be constructed, and any two nearby colors in an HSL map are perceptually similar, while any two distant colors in an such a map are perceptually dissimilar. The hue and saturation subspace of that map is familiar as the ` color wheel ', and has been understood for more than a century. In that map, hue is the angular dimension, saturation is the radial dimension, and lightness (if it were shown) would be perpendicular to the other two.Meister argues that

must be based upon a corresponding perceptual map. Since no such reliable map exists for olfaction, Meister argues, we cannot even begin to construct an for the smell problem; in fact, the actually used in the*Science*paper assumes a map with 128 dimensions, corresponding to the dimensionality of the stimulus not the (unknown) dimensionality of the perceptual space. By using such a high dimensional version of , a very high large value of is guaranteed, but unwarranted.

In my paper with Castro, we show that the original paper proves too much in a different way. We show that very similar datasets (differing only in the number of subjects, the number of experiments, or the number of molecules) or very similar analytical choices (differing only in the statistical significance criterion or discriminability thresholds used) produce vastly different estimates for

, differing over tens of orders of magnitude from the reported value. Even trivial differences produce absurd results such as `all possible odors can be discriminated' or `at most 1 odor can be discriminated'. The differences were trivial in the sense that equally reasonable experimental designs and analyses could and have proceeded according to these differences. But the resulting conclusions are obviously false, and therefore the analysis framework has proved too much. This kind of demonstration of `proving too much' differs from that in Meister's paper. Whereas he showed that the analysis framework produces specific values that are known to be incorrect, we showed that it can produce any value at all under equally reasonable assumptions. For many of those assumptions, we don't know if the values it produces is correct or not; after all, there may be or or discriminable odors - we don't know. But if all values are equally justified, the framework proves too much.We then showed the technical source of the error, which is a very steep dependence of

on incidental features of the study design, mediated by , which is then amplified exponentially by a steep nonlinearity in . I'll illustrate with a much more well-known example from gene expression studies. When identifying genes that are thought to be differentially expressed in some disease or phenotype of interest, there is always a statistical significance threshold, e.g. , , etc. used for selection. After correcting for multiple comparisons, some number of genes pass the threshold and are identified as candidates for involvement in the phenotype. With a liberal threshold, e.g. , many candidates will be identified (e.g. 50). With a more moderate threshold, e.g. , fewer candidates will be identified (e.g. 10). With a more strict threshold, e.g. , still fewer candidates will be identified (e.g. 2). This sensitivity is well known in gene expression studies. We showed that the function in the original paper has a similar sensitivity.Now suppose some researcher went a step further and said, "If there are

candidates genes involved in inflammation, and each has two expression levels, then there are inflammation phenotypes". Then the estimate for the number of inflammation phenotypes might be:at , at , and at .

Any particular claim about the number of inflammation phenotypes from this approach would be arbitrary, incredibly sensitive to the significance threshold, and not worth considering seriously. One could obtain nearly any number of inflammation phenotypes one wanted, just by setting the significance threshold accordingly (and all of those thresholds, in different contexts, are considered reasonable in experimental science).

But this is essentially what the function

does in the original paper. By analogy, is the thresholding step, is the number of candidate genes, and is the number of inflammation phenotypes. And while all of the possible values for in the*Science*paper are arbitrary, a wide range of them would have been unimpressively small, another wide range would have been comically large, and only the `goldilocks zone' produced the impressive but just plausible value reported in the paper. This is something that I think can and does happen to all scientists. If your first set of analysis decisions gives you a really exciting result, you may forget to check whether other reasonable sets of decisions would give you similar results, or whether instead they would give you any and every result under the sun. This robustness check can prevent you from proving too much - which really means proving nothing at all.