TitleThe Statistical Pitfall Of Genetic Diversity: Adjustments In Case-Control Association Tests                                                                                    

Abstract: Genome-wide case-control association studies have been successful in identifying novel common variants involved in the pathogenesis of complex disorders. However, the problem of population stratification remains a major limitation of such studies. While methods have been developed (e.g., Genomic Controls, STRUCTURE along with STRAT, EIGENSTRAAT) to infer on population structure and correct for stratification in the tests for association, the estimation of the number of underlying subpopulations (K), which is of additional interest from an evolutionary perspective, has not been adequately addressed, except in STRUCTURE. In order to circumvent the problem of estimation of parameters in high dimensional spaces, STRUCTURE adopts an ad hoc approach of Bayesian deviance that tends to overestimate K and may lead to reduced power in detecting association.  We have developed a Bayesian semi-parametric approach in the lines of Bhattacharya (2008) to estimate population structure under the assumption that K is random. The model is complemented by a summarization of the clustering data generated by the MCMC based on an elegant “Central Clustering” approach developed by Mukhopadhyay et al. (2011). Our approach has several advantages over STRUCTURE, the most prominent being a substantial reduction in computational time. Based on extensive simulations under a set-up of no admixture and an unlinked set of markers, we found that our method provides more accurate estimates of K compared to STRUCTURE and is marginally more powerful than STRAT after controlling for the overall false positive rate. We have also analyzed the Human Genome Diversity Panel data using our model and have obtained very good clustering of the individuals in the panel.