Estimation in biology/animal breeding using standard ANOVA/REML methods of variance components such as heritability, shared-environment, maternal effects etc. typically requires individuals of known relatedness such as parent/child; this is often unavailable or the pedigree data unreliable, leading to inability to apply the methods or requiring strict laboratory control of all breeding (which threatens the external validity of all estimates), and several authors have noted that relatedness could be measured directly from genetic markers (and if individuals were reasonably related, economically few markers would have to be obtained for statistical power), leading Kermit Ritland to propose in 1996 that directly measured pairwise relatedness could be compared to pairwise phenotype measurements (Ritland 1996, "A Marker-based Method for Inferences About Quantitative Inheritance in Natural Populations" Archived 2009-06-11 at the Wayback Machine2).
As genome sequencing costs dropped steeply over the 2000s, acquiring enough markers on enough subjects for reliable estimates using very distantly related individuals became possible. An early application of the method to humans came with Visscher et al. 20063/2007,4 which used SNP markers to estimate the actual relatedness of siblings and estimate heritability from the direct genetics. In humans, unlike the original animal/plant applications, relatedness is usually known with high confidence in the 'wild population', and the benefit of GCTA is connected more to avoiding assumptions of classic behavioral genetics designs and verifying their results, and partitioning heritability by SNP class and chromosomes. The first use of GCTA proper in humans was published in 2010, finding 45% of variance in human height can be explained by the included SNPs.56 (Large GWASes on height have since confirmed the estimate.7) The GCTA algorithm was then described and a software implementation published in 2011.8 It has since been used to study a wide variety of biological, medical, psychiatric, and psychological traits in humans, and inspired many variant approaches.
Main article: Twin study § Criticism
Twin and family studies have long been used to estimate variance explained by particular categories of genetic and environmental causes. Across a wide variety of human traits studied, there is typically minimal shared-environment influence, considerable non-shared environment influence, and a large genetic component (mostly additive), which is on average ~50% and sometimes much higher for some traits such as height or intelligence.9 However, the twin and family studies have been criticized for their reliance on a number of assumptions that are difficult or impossible to verify, such as the equal environments assumption (that the environments of monozygotic and dizygotic twins are equally similar), that there is no misclassification of zygosity (mistaking identical for fraternal & vice versa), that twins are unrepresentative of the general population, and that there is no assortative mating. Violations of these assumptions can result in both upwards and downwards bias of the parameter estimates.10 (This debate & criticism have particularly focused on the heritability of IQ.)
The use of SNP or whole-genome data from unrelated subject participants (with participants too related, typically >0.025 or ~fourth cousins levels of similarity, being removed, and several principal components included in the regression to avoid & control for population stratification) bypasses many heritability criticisms: twins are often entirely uninvolved, there are no questions of equal treatment, relatedness is estimated precisely, and the samples are drawn from a broad variety of subjects.
In addition to being more robust to violations of the twin study assumptions, SNP data can be easier to collect since it does not require rare twins and thus also heritability for rare traits can be estimated (with due correction for ascertainment bias).
GCTA estimates can be used to resolve the missing heritability problem and design GWASes which will yield genome-wide statistically-significant hits. This is done by comparing the GCTA estimate with the results of smaller GWASes. If a GWAS of n=10k using SNP data fails to turn up any hits, but the GCTA indicates a high heritability accounted for by SNPs, then that implies that a large number of variants are involved (polygenicity) and thus that much larger GWASes will be required to accurately estimate each SNP's effect and directly account for a fraction of the GCTA heritability.
GCTA provides an unbiased estimate of the total variance in phenotype explained by all variants included in the relatedness matrix (and any variation correlated with those SNPs). This estimate can also be interpreted as the maximum prediction accuracy (R^2) that could be achieved from a linear predictor using all SNPs in the relatedness matrix. The latter interpretation is particularly relevant to the development of Polygenic Risk Scores, as it defines their maximum accuracy. GCTA estimates are sometimes misinterpreted as estimates of total (or narrow-sense, i.e. additive) heritability, but this is not a guarantee of the method. GCTA estimates are likewise sometimes misinterpreted as "lower bounds" on the narrow-sense heritability but this is also incorrect: first because GCTA estimates can be biased (including biased upwards) if the model assumptions are violated, and second because, by definition (and when model assumptions are met), GCTA can provide an unbiased estimate of the narrow-sense heritability if all causal variants are included in the relatedness matrix. The interpretation of the GCTA estimate in relation to the narrow-sense heritability thus depends on the variants used to construct the relatedness matrix.
Most frequently, GCTA is run with a single relatedness matrix constructed from common SNPs and will not capture (or not fully capture) the contribution of the following factors:
GCTA makes several model assumptions and may produce biased estimates under the following conditions:
The original "GCTA" software package is the most widely used; its primary functionality covers the GREML estimation of SNP heritability, but includes other functionality:
Other implementations and variant algorithms include:
Figure 3 of Yang et al 2010, or Figure 3 of Ritland & Ritland 1996 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232052/figure/F3/ ↩
see also Ritland 1996b, "Estimators for pairwise relatedness and individual inbreeding coefficients" Archived 2017-01-16 at the Wayback Machine; Ritland & Ritland 1996, "Inferences about quantitative inheritance based on natural population structure in the yellow monkeyflower, Mimulus guttatus" Archived 2016-09-24 at the Wayback Machine; Lynch & Ritland 1999, "Estimation of Pairwise Relatedness With Molecular Markers"; Ritland 2000, "Marker-inferred relatedness as a tool for detecting heritability in nature" Archived 2016-09-25 at the Wayback Machine; Thomas 2005, "The estimation of genetic relationships using molecular markers and their efficiency in estimating heritability in natural populations" http://genetics.forestry.ubc.ca/ritland/reprints/1996_GenetResearch_r.pdf ↩
Visscher et al 2006, "Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings" http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020041 ↩
Visscher et al 2007, "Genome partitioning of genetic variation for height from 11,214 sibling pairs" http://www.sciencedirect.com/science/article/pii/S0002929707638841 ↩
"Common SNPs explain a large proportion of heritability for human height", Yang et al 2010 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232052/ ↩
"A Commentary on ‘Common SNPs Explain a Large Proportion of the Heritability for Human Height’ by Yang et al. (2010)", Visscher et al 2010 https://www.ncbi.nlm.nih.gov/pubmed/21142928 ↩
"Defining the role of common variation in the genomic and biological architecture of adult human height", Wood et al 2014 http://neurogenetics.qimrberghofer.edu.au/papers/Wood2014NatGenet.pdf ↩
"GCTA: A Tool for Genome-wide Complex Trait Analysis", Yang et al 2011 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/ ↩
"Meta-analysis of the heritability of human traits based on fifty years of twin studies", Polderman et al 2015 http://www.gwern.net/docs/genetics/2015-polderman.pdf ↩
Barnes, J. C.; Wright, John Paul; Boutwell, Brian B.; Schwartz, Joseph A.; Connolly, Eric J.; Nedelec, Joseph L.; Beaver, Kevin M. (2014-11-01). "Demonstrating the Validity of Twin Research in Criminology". Criminology. 52 (4): 588–626. doi:10.1111/1745-9125.12049. ISSN 1745-9125. https://www.researchgate.net/publication/267158254 ↩
"GCTA will eventually provide direct DNA tests of quantitative genetic results based on twin and adoption studies. One problem is that many thousands of individuals are required to provide reliable estimates. Another problem is that more SNPs are needed than even the million SNPs genotyped on current SNP microarrays because there is much DNA variation not captured by these SNPs. As a result, GCTA cannot estimate all heritability, perhaps only about half of the heritability. The first reports of GCTA analyses estimate heritability to be about half the heritability estimates from twin and adoption studies for height (Lee, Wray, Goddard, & Visscher, 2011; Yang et al., 2010; Yang, Manolio, et al" 2011), and intelligence (Davies et al., 2011)." pg110, Behavioral Genetics, Plomin et al 2012 https://www.dropbox.com/s/1iz7o1hqb8isas2/2012-plomin-behavioralgenetics.pdf ↩
"Meta-analysis of GREML results from multiple cohorts", Yang 2015 http://gcta.freeforums.net/thread/213/analysis-greml-results-multiple-cohorts ↩
Ge, Tian; Chen, Chia-Yen; Neale, Benjamin M.; Sabuncu, Mert R.; Smoller, Jordan W. (2016). "Phenome-wide Heritability Analysis of the UK Biobank". bioRxiv 10.1101/070177. /wiki/BioRxiv_(identifier) ↩
Pasaniuc & Price 2016, "Dissecting the genetics of complex traits using summary association statistics" https://www.dropbox.com/s/4mgmun29xbund7z/2016-pasaniuc.pdf ↩
Bulik-Sullivan, B. K.; Loh, P. R.; Finucane, H.; Ripke, S.; Yang, J.; Schizophrenia Working Group of the Psychiatric Genomics Consortium; Patterson, N.; Daly, M. J.; Price, A. L.; Neale, B. M. (2015). "LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies". Nature Genetics. 47 (3): 291–295. doi:10.1038/ng.3211. PMC 4495769. PMID 25642630. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4495769 ↩
"LD Hub: a centralized database and web interface to LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis", Zheng et al 2016 http://biorxiv.org/content/biorxiv/early/2016/05/03/051094.full.pdf ↩
"Contrasting the genetic architecture of 30 complex traits from summary association data", Shi et al 2016 http://biorxiv.org/content/early/2016/01/14/035907 ↩
Schweiger, Regev; Kaufman, Shachar; Laaksonen, Reijo; Kleber, Marcus E.; März, Winfried; Eskin, Eleazar; Rosset, Saharon; Halperin, Eran (2 June 2016). "Fast and Accurate Construction of Confidence Intervals for Heritability". The American Journal of Human Genetics. 98 (6): 1181–1192. doi:10.1016/j.ajhg.2016.04.016. PMC 4908190. PMID 27259052. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908190 ↩
"Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection", Gazal et al 2017 https://www.dropbox.com/s/idh2vm1dkar3qho/2017-gazal.pdf ↩
"Fast linear mixed models for genome-wide association studies", Lippert 2011 https://www.researchgate.net/profile/David_Heckerman/publication/51618535_FaST_linear_mixed_models_for_genome-wide_association_studies/links/5485d3a70cf268d28f00456a.pdf ↩
"Improved linear mixed models for genome-wide association studies", Listgarten et al 2012 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3597090/ ↩
"Advantages and pitfalls in the application of mixed-model association methods", Yang et al 2014 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3989144/ ↩
"A lasso multi-marker mixed model for association mapping with population structure correction", Rakitsch et al 2012 https://web.archive.org/web/20151204193223/http://bioinformatics.oxfordjournals.org/content/29/2/206.full#aff-1 ↩
"Genome-wide efficient mixed-model analysis for association studies", Zhou & Stephens 2012 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3386377/ ↩
"Variance component model to account for sample structure in genome-wide association studies", Kang et al 2012 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3092069/ ↩
"Advanced Complex Trait Analysis", Gray et al 2012 https://web.archive.org/web/20160522235202/http://bioinformatics.oxfordjournals.org/content/early/2012/09/27/bioinformatics.bts571.full.pdf ↩
"Regional Heritability Advanced Complex Trait Analysis for GPU and Traditional Parallel Architecture", Cebamanos et al 2012 https://www.semanticscholar.org/paper/Regional-heritability-advanced-complex-trait-Cebamanos-Gray/c340835e1baf4b9fcafbfb001841bbd4793f598f/pdf ↩
"Efficient Bayesian mixed model analysis increases association power in large cohorts", Loh et al 2012 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342297/ ↩
"Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis", Loh et al 2015; see also "Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis", Loh et al 2015 http://biorxiv.org/content/biorxiv/early/2015/06/05/016527.full.pdf ↩
"Mixed Models for Meta-Analysis and Sequencing", Bulik-Sullivan 2015 http://biorxiv.org/content/early/2015/05/29/020115 ↩
"Massively expedited genome-wide heritability analysis (MEGHA)", Ge et al 2015 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4345618/ ↩
Speed et al 2016, "Re-evaluation of SNP heritability in complex human traits" http://biorxiv.org/content/early/2017/01/15/074310 ↩
Evans et al 2017, "Narrow-sense heritability estimation of complex traits using identity-by-descent information." http://www.biorxiv.org/content/early/2017/07/17/164848 ↩