Hypothesis testing is a common statistical method for testing an assumption. Did changing the font of a web page cause people to spend more time on it? Does a particular gene contribute to a particular trait? The A/B testing widely used by technology companies — to compare customer responses to two versions of a product or service — is an example of statistical hypothesis testing.
In the online version of hypothesis testing, hypotheses are tested in sequence. Once you’ve finished an A/B test to decide on a font for your web page, for instance, you may want to follow up with an experiment comparing different background colors.
Ideally, when testing background colors, you’d like to control for all the other elements on the page — page content, size and placement of banner ads, desktop versus mobile layouts, and so on. But that would mean testing all possible permutations of all those elements for each background color — and acquiring enough data about each to yield a statistically significant result. Such fine-grained experimental controls are rarely practical.
At this year’s International Conference on Artificial Intelligence and Statistics (AISTATS), we presented an online hypothesis-testing method that factors in side information — such as those additional elements of web page design — without requiring dozens of different experiments for dozens of control groups.
Instead, we use a common idea in machine learning: a contextual vector, which captures information about the context in which the experiment takes place. We show, theoretically, that if the contextual vectors contain real information about the experimental context, their addition increases the statistical power of the hypothesis-testing method, or its ability to identify true phenomena.
At the same time, we prove that even with the addition of contextual vectors, a variant of the online hypothesis-testing method pioneered by Amazon senior principal scientists Dean Foster and Robert Stine, called alpha-investing rules, can still enforce a predetermined limit on the hypothesis-testing false-discovery rate, or the frequency with which the method accepts a false hypothesis as true.
Alpha-investing rules
In hypothesis testing, each hypothesis has a p-value, which is the probability of getting a more significant result than random chance would dictate.
When a number of hypotheses are being tested at once, it’s possible to lose control of the false-positive proportion, even if the experiments are correctly designed and carried out. In this context, we need to adjust the p-values of the hypotheses based on the desired false-discovery rate and compute a threshold adjusted p-value that any hypothesis must meet to be considered valid. In an offline setting, this is commonly achieved through a classic procedure introduced by Yoav Benjamini and Yosef Hochberg in 1995.
In the online setting, however, the p-values must be adjusted — and the threshold estimated — on the fly. Foster and Stine’s alpha-investing rules, which later researchers developed into generalized alpha-investing (GAI) rules, were designed to control the false-discovery rate in this context.
GAI rules begin with a budget of false discoveries (called “wealth”, on the investing analogy), which corresponds to the desired maximum rate of false discovery. Testing a new hypothesis incurs a cost, which diminishes the budget. But identifying a valid hypothesis increases the budget, enabling more liberal acceptance of subsequent hypotheses.
Once the budget of false discoveries is exhausted, testing is at an end. This assures that the testing procedure never exceeds the maximum false-discovery rate.
The power of context
In our procedure, which we call contextual GAI, we use the context vector to adjust the false-discovery budget available for each hypothesis. Sometimes the adjustment is upward, increasing the likelihood of accepting the hypothesis, sometimes downward, decreasing the likelihood.
The degree of adjustment is determined by a function with a tunable parameter that depends on the results of previous hypothesis tests; that is, the function learns to make more useful adjustments as testing progresses. In our experiments, we used a neural network to learn the adjustments.
In the paper, we prove that, so long as the context vector captures real information about the experimental context, this approach will always increase the statistical power of the test, relative to standard GAI. The degree of increase depends, of course, on the choice of contextual features, which is up to the designer of the experiment.
We also show that contextual GAI enforces the same limit on the false-discovery rate that classical GAI does.
Finally, we apply our method to data from public data sets on diabetes prediction and gene expression analysis and show that, in both cases, it increases the statistical power of the analysis while enforcing a cap on the false-discovery rate.
The diabetes prediction data set includes biographical information about each patient and details about medications, lab results, immunizations, allergies, and vital signs. We used only the biographical information for our context vector.
With a limit on the false discovery rate of 0.2, our approach increased the statistical power of the hypothesis testing procedure by about 51%.
In the genetic-analysis data set, genetic markers called single-nucleotide polymorphisms (SNPs) have already been associated with traits of interest, and the goal is to find associations between those SNPs and gene product concentrations in cells. So each SNP is tested in conjunction with each nearby gene.
Besides each SNP-gene pair, we experimented with three different sources of information for the contextual vectors: distance between the SNP and the gene; the prevalence of the protein associated with the gene across individuals; and the evolutionary conservation of the gene — the degree to which it is shared across species — as measured by the standard PhastCons scores.
With all three information sources, our approach increased the number of gene-trait correlations we discovered, by 5.5%, 2.6%, and 2%, respectively.