Methods in Statistical Genomics
Methods in Statistical Genomics
Statistical Genomics
In the Context of Genome-Wide Association Studies
RTI Press
2016 RTI International. All rights reserved. This The RTI Press mission is to disseminate
book is protected by copyright. Credit must be information about RTI research, analytic tools,
provided to the author and source of the book and technical expertise to a national and
when the content is quoted. No part of this book international audience. RTI Press publications
may be reproduced in any form by any electronic are peer-reviewed by at least two independent
or mechanical means (including photocopying, substantive experts and one or more Press
recording, or information storage and retrieval) editors.
without permission in writing from the publisher. RTI International is an independent, nonprofit
RTI International is a registered trademark and a research organization dedicated to improving
trade name of Research Triangle Institute. the human condition by turning knowledge
into practice. RTI offers innovative research
and technical services to governments and
businesses worldwide in the areas of health and
pharmaceuticals, education and training, surveys
and statistics, advanced technology, international
development, economic and social policy, energy
and the environment, and laboratory testing and
chemistry services.
Acknowledgment 149
Contributors 151
Index 153
CHAPTER 1
Overview of Chapters
Philip Chester Cooley
Introduction
The objective of this book is to describe procedures for analyzing genome-wide
association studies (GWAS). Some of the material is unpublished and contains
commentary and unpublished research; other material (Chapters 4 through 7)
has been published previously. Each previously published chapter investigates
a different genomics model, but all focus on identifying the strengths and
limitations of various statistical procedures that have been applied to different
GWAS scenarios.
The distinction between genotype and phenotype was initially presented
by the Danish botanist, plant physiologist, and geneticist Wilhelm Johannsen
in a book he published in 1905, The Elements of Heredity. He distinguished
between the genotype of the organism (it is hereditary) and the ways in which
its heredity is demonstrated in phenotypes, or physical characteristics. This
distinction was an outgrowth of Johannsens experiments concerning heritable
variation in plants.1
Today, it is understood that the process leading from genes to proteins that
ultimately establish phenotypes is complex. Most proteins are the products
of multiple genes. Whether a protein is an enzyme, receptor, or hormone,
it functions in a specific environment that includes external factors like
temperature, rainfall, the amount of sunlight available, and nutrition, as well as
internal factors that can include other hormones, enzymes, and other proteins.
Further, biochemical pathways are not always linear; they can have multiple
positive and negative feedback loops and may involve multiple steps and the
products of hundreds of genes. In summary, the evolutionary forces producing
a phenotype may often involve many genes and can be influenced by a variety
of specific environmental factors.
2 Chapter 1
polymorphism (SNP) arrays. If one type of the variant (one allele, i.e., the wild-
type allele) is more frequent in people with the disease, the SNP is said to be
associated with the disease. The associated SNPs are then considered to mark a
region of the human genome that influences the risk of the phenotype. Also, in
contrast to methods which specifically test one or a few genetic regions, GWAS
investigate the entire genome. The approach is therefore said to be noncandidate
driven, in contrast to gene-specific candidate-driven studies. GWAS identify tag
SNPs, which are defined as representative SNPs in a region of the genome with
high linkage disequilibrium and other variants in DNA associated with a disease.
Tag SNPs in isolation cannot specify which genes cause the phenotype.
The first successful GWAS investigated age-related macular degeneration
and was published in 2005.3 This study found two SNPs that had significantly
altered allele frequency when compared with healthy controls. As of 2015, The
Catalog of Published Genome-Wide Association Studies contained more than
2,141 catalog entries, 1,856 publications and 12,874 implicated SNPs.4 Prior to
the introduction of GWAS, the major method of investigation was via genetic
linkage studies in families. This approach was useful for identifying single-gene
disorders, many of which appear in the comprehensive compendium of human
genes and genetic phenotypes, the Online Mendelian Inheritance in Man
(OMIM) database.5
However, for both common and complex diseases, the results of genetic
linkage studies have been hard to reproduce.6,7 In contrast, GWAS seek
to identify whether the allele of a genetic variant is found more often than
expected in individuals with the phenotype of interest. The statistical methods
used in GWAS are based on traditional approaches, and early calculations of
statistical power indicated that GWAS could be better than linkage studies at
detecting weak genetic effects.8
In addition to a simple conceptual framework, the proliferation of GWAS
has also been driven by improvements in sequencing methods, reduced
computational costs, and the advent of biobanks, which are repositories of
human genetic material that greatly reduce the cost and difficulty of collecting
sufficient numbers of biological specimens for study.9,10 The development of
rapid genome-level sequencing techniques also permits researchers to assess
methods to mine this information to identify genetic associations with disease
and ultimately determine the biological basis of disease patterns. Knowing the
coding sequences of every nucleotide in an organism has permitted researchers
to study the collective influence of all genes simultaneously and their role in
structuring organism traits, including specific diseases.
4 Chapter 1
(HDL-C), one of the most important risk factors for coronary heart disease,
are significantly influenced by the interplay of multiple genes linked to GWAS
and involved gene-gene interaction effects.14
Third, GWAS are based on common variants (i.e., tag SNPs) that are
frequently in linkage disequilibrium with the actual causative variant, which
in turn may be associated with larger effect sizes than the common variant
included in the GWAS. For instance, fine mapping of loci associated with
low-density lipoprotein cholesterol (LDL-C) identified a rare nonsynonymous
variant gene that explained 5 times more of the contributed variance than the
initial GWAS finding. In this context, whole genome sequence data has the
potential to be a more accurate and powerful tool than SNPs to elucidate the
relationship between genetics and (common) diseases. However, even high-
resolution genetic variation will only explain a fraction of the heritability of
human diseases and traits. Thus, we are still searching for potential uses for
genetics in medical science beyond using simple genetics with gene-gene and
gene-environment interactions and identifying epigenetic effects as important
but complex targets.
Missing Heritability
Heritability is a genetic measure that identifies the observable differences in a
trait due to genetic factors between individuals within a population. Factors
including genetics, environment and random chance can all contribute to the
variation between individuals in their phenotypes.15 Heritability is a dynamic
measurement that identifies the fraction of phenotype variability that can be
attributed to genetic variation. The term missing heritability refers to the
low percentage of information about the overall genetic component and risk
of common diseases gleaned from GWAS. Common variants account for only
a small proportion of genetic components, and the missing heritability lies in
the huge class of rare genetic variants that GWAS do not see. Variants that are
primary drivers of disease are relatively rare in the human population.
Variants confer risk of disease, and natural selection acts against variants so
that they do not become too prevalent. Therefore, the issue of so-called missing
heritability becomes moot. We did not interrogate the whole genome; we
interrogated the common variants that pass through the filter of natural selection.
Many diseases, but not all, will involve rare variants not detected by GWAS.16
Autoimmune diseases may be an exception to this thinking. Some variants
that are major risk factors appear more commonly in the general population;
6 Chapter 1
some possible genes linking ALS to genetic causes. By and large, no consistent
markers have been universally accepted as ALS genetic markers.
We obtained the SNP data for the initial ALS study appearing in the
literature21 and attempted to replicate some of the published results in
order to test statistical methods in genomics.21 We identified seven distinct
methods that have been or could be applied in GWAS studies. We also used
the method that Schymick et al. used to obtain their results. At the time we
performed these analyses, we were unaware of any comprehensive studies that
compared the performance of the different methods in the context of GWAS,
and we wanted to determine whether standards could be developed. Using
a previously conducted study would allow us to assess the performance of
specific statistical methods by using the study results as a yardstick by which to
measure the accuracy of our results.
What we learned from this effort is that either many of the algorithms
used in the literature in a GWAS context assume an additive gene model or
are agnostic with respect to the form of inheritance. We also documented
the inability of the ALS studies to replicate resultspart of the problem with
reproducing results is that ALS studies rely on having ALS patients as a study
population, which means that the original sample sizes were relatively small
due to circumstance. We were able to replicate the Schymick et al. study
results using the method they chose to measure associationsthe classic
epidemiology case-control method that uses the Pearson 2 test to text how
likely it is that an observed distribution of data fits with the distribution that is
expected if the variables are independent. However, our assessment indicated
that the reported results depended to an unknown degree on the statistical
method used to make the predictions. This suggested that the algorithms
selected could influence predictive outcome and further demonstrated the
need for developing GWAS standards.
In summary, the absence of both methodological standards and a process
for evaluating the statistical methods used in predicting associations between
genes and phenotypes suggested to us that we could examine these missing
elements more effectively by using simulation methods. Accordingly, we
created simulated data that were linked to known outcomes which therefore
constituted a truth set. The simulated data could be analyzed using different
statistical methods, and we could assess each methods predictive properties,
which could potentially reveal some or all of this missing information.
8 Chapter 1
Our simulation studies confirmed the results of others23 that the gene
model or MOI was a major influence on statistical power. This was no surprise
and it is well known that associations involving recessive MOI SNPs are much
more difficult to detect than other MOI types. Gene traits that influence
prediction accuracy had also been reported in other studies.24,25 They
demonstrated how the phenotype MOI assumption was a major influence on
association prediction accuracy.
We compared the power profiles of GWAS using a number of statistical
methods, including two that combine MOI-specific methods into multiple
test measures. Because most GWAS investigations have not determined the
specific gene model operating, many assume an additive model. Given that
most models cited in OMIM are either dominant or recessive gene models, we
investigated composite methods that did not make an additive assumption;
rather, they used three component tests with three distinct MOI properties
(dominant, recessive, and additive). We then compared the performance
(from a statistical power perspective) of the composite tests as contrasted with
methods that either assume a specific MOI gene model or are agnostic with
respect to MOI. By combining recessive, additive, and dominant individual
tests, we determined that if the MOI is not known, then a composite test
is more likely to make a correct association prediction. In that sense, it
constitutes a more powerful test and could have significant advantages with
respect to single test procedures.
Our findings did not provide a specific answer about which statistical
method is best. The best method depends on the MOI gene model associated
with the phenotype (diagnosis) in question and how common the traits are
that associate with the phenotype. However, our results do indicate that the
common additive assumption that the MOI of the locus is associated with the
diagnosis can have adverse consequences. It indicated that researchers should
consider a multitest procedure that combines the results of individual MOI-
based core tests as a statistical method for conducting the initial screen in a
GWAS. The process for combining the core tests into a single operational test
can occur in a number of ways. We identify two: the Bonferroni procedure
and the MAX procedure, each of which produces very similar statistical power
profiles.26,27 This was a surprising result because the Bonferroni procedure
was based on combining 2 tests and assumed that the tests were mutually
independent when they clearly are not, whereas the MAX procedure used
10 Chapter 1
normal distribution tests and adjusted for the covariance properties between
individual tests.
In summary, the focus of this chapter is on single-gene statistical methods
that predict associations in a GWAS context. We used simulation methods to
learn that there is no single, most powerful method. If the properties of the
gene model are not known, the most powerful approach is a composite test
that uses a recessive-dominant-additive composite model. We also found that
regardless of whether the MOI is known or not, there always exists a method
that outperforms (in a statistical power context) the Pearson 2 test.
Chapter 9: Conclusions
Our final chapter discusses statistical power properties in the GWAS context
and offers guidance on selecting best methods. Our early concerns with the
absence of standards led us to develop a synthetic gene database that recorded
known outcomes between synthetic phenotypes and genotype networks
provides a mechanism that was not possible using real genomics data. The
creation of this database enabled an evaluation of different statistical models
and methods specifically because the prediction outcomes were known and
statistical power profiles could be estimated. We also investigated methods
that examine combinations of genes acting in concert either with other
genes or environmental factors. Our assessments indicated that often single-
gene models will fail to identify markers in many types of gene-gene, gene-
environment networks.
14 Chapter 1
Chapter References
1. Johannsen WL. Arvelighedslrens elementer (the elements of heredity).
Copenhagen, Denmark: Gyldendalske Boghandel Nordisk Forlag; 1905.
2. Alberts B, Johnson A, Lewis J, et al. Molecular biology of the cell. 4th ed.
New York, NY: Garland Science; 2002.
3. Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymorphism in
age-related macular degeneration. Science. 2005;308(5720):385-9.
4. Burdett T, Hall P, Hasting E, et al. The NHGRI-EBI Catalog of published
genome-wide association studies. 2015 [cited 2015 Nov 2]; Available from:
www.ebi.ac.uk/gwas
5. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins
University School of Medicine. Online Medelian Inheritance in Man
(OMIM). 2016 [cited 2016 Feb 11]; Available from: http://www.ncbi.nlm.
nih.gov/omim
6. Altmuller J, Palmer LJ, Fischer G, et al. Genomewide scans of complex
human diseases: true linkage is hard to find. Am J Hum Genet.
2001;69(5):936-50.
7. Strachan T, Read A. Human molecular genetics. New York, NY: Garland
Science; 2010.
8. Risch N, Merikangas K. The future of genetic studies of complex human
diseases. Science. 1996;273(5281):1516-7.
9. Ardini MA, Pan H, Qin Y, et al. Sample and data sharing: observations
from a central data repository. Clin Biochem. 2014;47(4-5):252-7.
10. Greely HT. The uneasy ethical and legal underpinnings of large-scale
genomic biobanks. Annu Rev Genomics Hum Genet. 2007;8:343-64.
Overview of Chapters 15
Introduction
In this chapter we assess the predictive strength of a number of classical
statistical methods by applying them to a publically available set of
amyotrophic lateral sclerosis (ALS) data reported in a paper by Schymick
and colleagues (2007).1 We used these methods in the context of a single
locus genome wide association study (GWAS) experiment. The methods
are compared and the degree of similarity/dissimilarity between them is
empirically measured to determine if a combination of methods is more
predictive of phenotype genotype associations than a single method. All of the
methods in our assessment are single nucleotide polymorphism (SNP) based.
There are three types of ALS: classic sporadic, familial, and the Mariana
Island forms. Classic ALS accounts for 90 to 95 percent of ALS patients in the
United States and is called sporadic because it cannot be traced to ancestors
with the illness.2 The literature identifies variations and mutations in many
Mendelian loci and genes that potentially cause different forms and subtypes
of ALS, indicating that many complex and diverse molecular mechanisms are
involved in ALS pathogenesis. These genes include SOD1, ALS2, SETX, and
VAPB for familial ALS, and VEGF, ANG, HFE, SMN, and PON1 for sporadic
ALS. Research has also reported that the inheritance pattern varies with the
type of ALS, including autosomal dominant, autosomal recessive, X-linked
dominant and maternal through mitochondrial genes.3-5
Despite these complexities, ALS researchers have intensified their
investigations of sporadic ALS. Researchers are applying the GWAS approach,
looking for genes that increase susceptibility to sporadic ALS. By 2008 when
the authors investigations began, seven teams had reported results from ALS-
based GWAS,6-12 but none had highlighted genes already under suspicion,
18 Chapter 2
and each team had reported a different set of ALS markers. Subsequently, eight
additional studies13-20 have been reported in the literature. These results have
led to conjectures that using the GWAS method to search for ALS genes may
have to accommodate a spectrum of genes, each of which contribute to ALS
in some unknown manner. The failure of the different studies to replicate each
others results also suggests that GWAS may be inconsistently implemented, or
population admixtures are obscuring the findings.
The first GWAS for ALS found no SNPs significantly associated with the
disease.1 However, published studies that followed implicated several different
ALS genetic markers. For example, Dunckley and colleagues6 identified an
SNP located in the FLJ10986 gene. This study consisted of 1,152 patients
diagnosed with sporadic ALS and 1,297 controls. The initial discovery was
made on analysis done on 386 cases and 547 controls, all of whom were of
European descent and older than 65. A residual sample of 766 cases and 750
controls, as well as a subsample of the data identified in the Schymick study,
were used to replicate the discovery analysis.6 A third ALS study reported that
the inositol 1, 4, 5-triphosphate receptor 2 (ITPR2) marker was associated
with ALS. This study pooled three European populations with 1,337 ALS
patients and 1,356 controls.7 A fourth study (by the same team as the third
study) identified an SNP in the dipeptidyl-peptidase 6 (DPP6) gene that was
strongly associated with ALS susceptibility.8 However, the ITPR2 marker was
no longer significant. A fifth study examined an Irish population, which was
augmented with a Dutch and a US population consisting of 958 ALS cases
and 932 controls, and confirmed an association with the DPP6 marker.21
However, a sixth study sought to confirm the DPP6 marker finding by
examining the Irish ALS cohort data and augmenting it with a Polish cohort.
Cronin and colleagues,21 reported that their analysis of the combined cohorts
that consisted of 1,267 cases and 1,336 controls was unable to identify any
associations including the previously reported DPP6 marker.21 A seventh
study, performed by Chi et al.,9 used a two-stage analysis that consisted
of 553 cases and 2,338 controls: it identified two new markers, but markers
mentioned previously, including DPP6.
Our initial assessment concluded that clear and definitive disease
associations from these seven significant ALS studies was lacking and
suggested difficulties of using GWAS approaches for complex diseases like
ALS. Consequently, we made the decision at that time to pursue simulation-
based studies with known outcomes in an effort to explore the possibility of
developing standard procedures for conducting GWAS.
Genome-Wide Association Data: Where Are the Standards? 19
Methods
Our study assessed the statistical methods that have appeared in the GWAS
literature and included a number of established methods used by the cited
studies. We applied each of these methods to the ALS data of Schymick and
colleagues1 and provide a method of comparing their relative performance.
with 2 degrees of freedom to test the hypothesis that the cases and controls are
from the same distribution. Under the null hypothesis of no association with
disease, we expect the relative genotype frequencies to be the same in cases and
controls. These types of methods are commonly used in the GWAS context. A
case-control study uses the odds ratio to estimate the relative risk and assumes
that the disease under study has a low incidence. When the risk ratio is the
parameter of interest, the assumption of rarity is needed for the odds ratio to
be a consistent estimator.31,32 This method was used in the original Schymick
study.
Normal Approximation to Fishers Exact TestDominant and Recessive Models. The
null hypothesis behind Fishers test is that the rows (phenotype) and columns
(genotypes) are unrelated. The test calculates an exact probability value for the
relationship between three dichotomous variables, as found in a 23 table.
When N (the number of subjects) is large, the exact form of the Fisher test is
difficult to calculate. Therefore, a normal approximation is used. Because the
test estimates the probability of a given genotype using the marginal values and
assumes the probability is from a normal curve, an autosomal recessive and
dominant version of the test is easily implemented. We used both forms in our
assessment, listed as Fis-D and Fis-R in Table 2.1.
Logistic Regression Linear and Categorical Model Tests. The logistic regression
test (Log-A) assumes an additive mode of phenotype inheritance and regresses
case control outcomes using the number of minor alleles as the dependent
variable. In the Log-A test, the null hypothesis 1 = 0 is used to test if the
number of alleles associates with the phenotype variable.33
Table 2.1. Correlation values for eight statistical tests based on unadjusted p-values
Test
Name Fis-D Pea Tr-A All Fis-R Tr-D Log-A Tr-R
Fis-D 1.0 -.0472 -.0544 -.0535 0.2215 -.0947 -.0551 -.0041
Pea -.0472 1.0 0.4861 0.4832 -.0651 0.4818 0.4874 0.4793
Tr-A -.0544 0.4861 1.0 0.9843 -.0105 0.5294 0.9973 0.0821
All -.0535 0.4832 0.9843 1.0 -.0104 0.5269 0.9834 0.0811
Fis-R 0.2215 -.0651 -.0105 -.0104 1.0 -.0026 -.0107 -.1677
Tr-D -.0947 0.4818 0.5294 0.5269 -.0026 1.0 0.5299 0.0020
Log-A -.0551 0.4874 0.9973 0.9834 -.0107 0.5299 1.0 0.0822
Tr-R -.0041 0.4793 0.0821 0.0811 -.1677 0.0020 0.0822 1.0
Fis-D = Fisher dominant test; Pea = Pearson 2 test; Tr-A = trend additive test; All = allelic test; Fis-R = Fisher
recessive test; Tr-D = trend dominant test; Log-A = logistic linear test; Tr-R = trend recessive.
22 Chapter 2
Results
A fundamental question this study seeks to address is which method should
investigators use to assess candidate associations? No one has yet answered
this question. One obvious partial answer is that it depends on the gene
behavior, as well as a number of other biological factors yet to be determined.
Consequently, if we assume an additive gene model, then the likelihood of
establishing the association between genotype and phenotype will be nearly
the same whether or not one uses any of the three additive-based tests.
Test Performance
In an attempt to address the question above, we created a 23 table of counts
of case-control subjects by genotype counts for each locus. We then added
the marginal values and provided the necessary information to apply the
eligibility criteria to eliminate loci with low minor genotype representation.
The data provides information on 555,352 loci, of which 538,234 are autosome
loci. Applying a standard quality control method described by Zeggini et al.30
to the autosome loci identifies 55,304 loci (10 percent) that were screened
ineligible and not considered in the analysis due to low minor genotype
representation. Table 2.1 presents the 88 matrix of Pearson product-moment
correlation coefficients for all possible pairs of the eight statistical tests. Note
Genome-Wide Association Data: Where Are the Standards? 23
that the statistical tests differ in the MOI assumption and therefore should
provide different results. However, Table 2.1 indicates that although some of
the MOI-based tests (i.e., the additive tests) produce consistent results, the
dominant based tests (Tr-D and Fis-D: correlation coefficient = -.0947), and
the recessive based tests (Tr-R and Fis-R correlation coefficient = -.1677) are
not correlated. The Pearson test is MOI agnostic and would be expected to
overlap with the other tests, and although they both capture 48 percent of the
correlation space represented by many of the other tests, they have only a small
negative correlation with both of the Fisher tests. Thus, Table 2.1 illustrates that
all of the tests historically used in GWAS incorporate different assumptions
and consequently have differences in the way they measure association. Based
on results presented here, we assert that if the MOI characteristics of the
associated loci are unknown, then researchers should consider multiple tests
treated in a nonhierarchal manner.
In addition, Table 2.1 suggests a correlation between the three tests (Tr-A,
Log-A, and All) that assume additive MOI properties. This suggests that no
additional predictive power for measuring associations is derived from using
more than one of these three tests; that weak correlation between the two tests
(Tr-R and Fis-R) that assume recessive MOI properties and weak negative
correlation between the Fisher dominant MOI test (Fis-D) and all of the other
tests except the Fisher recessive MOI test (Fis-R), which implies that these tests
measure a dimension that is different from all of the other tests.
Marker Assessment
This section compares our results directly to those of Schymick and colleagues1
and indirectly to other results reported in the literature.
Schymick et al. (2007). The Schymick and colleagues study1 reports using
six association tests: the genotypic test, two versions of the trend test (the
dominant and additive tests), a recessive model test, an allele-based test, and
a three-marker haplotype-association test. We included their five-loci-based
tests in the tests we ran.
We applied the Pearson test, the recessive version of the trend test, and the
additive version of the logistic regression test to the same data. A summary
of these results is presented in Table 2.2 below. It compares the top 34 SNPs
reported by Schymick and colleagues1 with our results. We found that all 34
were positive at the e-4 level of significance based on the Pearson test but that
none were significant at the 10-7 level. Also, 13 of the SNPs were positive at the
24 Chapter 2
Table 2.2. Comparison between the results of our study and the Schymick et al. study
SNP ID Chrom.$ Location Gene Pea Tr-R Log-A
rs4363506 10q26.13 129164493 Intergenic <.000001 <.005 <.000001
rs16984239 2p24 18097927 Intergenic <.00001 X <.00001
rs12680546 8q24.2 136940921 Intergenic <.0001 X <.001
rs6013382 20q13.2 50136040 ZFP64 <.00001 X <.01
rs2782931 9q31.3 113890011 SUSD1 <.00001 <.0001 X
rs11099864 4q31.3 154112804 KIAA1727 <.00001 <.0005 X
rs332389 3p14.1 $66493904 SLC25A26 <.0001 <.01 X
rs4964213 12q23.3 106274907 BTBD11 <.0001 <.0001 <.0001
rs10765118 10q26.13 129175173 Intergenic <0001 <.00001 <.0001
rs3733242 4q21.1 77894529 SHROOM3 <.0001 <.00001 <.0001
rs1037666 1q43 238425108 FMN2 <.0001 X <.001
rs1436918 15q14 32724213 LOC390569 <.0001 <.0001 X
rs4552942 8q24.2 136943505 Intergenic <.0001 X <.001
rs852801 1p32.2 58094497 DAB1 <.0001 <.00001 <.001
rs852802 1p32.2 58096531 DAB1 <.0001 <.00001 <.001
rs7250467 19q12 33261241 LOC727771 <.0001 <.0001 <.0001
rs10830099 10q26 129174355 Intergenic <.0001 <.0001 < 0001
rs10459680 15q26 91482474 Intergenic <.0001 X <.001
rs1752784 9q22.32 96217647 HIATL1 <.0001 <.00001 <.01
rs1202824 1p32.2 58121593 DAB1 <.0001 X <.01
rs5014235 5q14.1 77245417 Intergenic <.0001 <.01 <.0001
rs7201419 16q23.3 81887480 CDH13 <.0001 <.0001 <.01
rs11933187 4q31.3 175446507 KIAA1717 <.0001 <.0001 <.001
rs10773543 12q24.32 127489679 TMEM132C <.0001 X <.001
rs7976059 12q13 50537539 Intergenic <.0001 <.0001 <.001
rs9608416 22q12.1 24441018 ADRBK2 <.0001 <.01 X
rs2220999 12q12 40422035 Intergenic <.0001 <.01 <.0001
rs12632457 3p24 27995556 Intergenic <.0001 <.01 <.0001
rs2272519 2p24 18575231 Intergenic <.0001 <.05 X
rs2289599 5q14.1 77243905 Intergenic <.0001 <.05 <.0001
rs4478530 8p12 31517686 Intergenic <.0001 <.05 X
rs130110 22q13.32 47470399 FAM19A5 <.0001 X <.0001
rs9510982 13q12 23451861 Intergenic <.0001 <.0001 X
rs2767584 6p21 156964439 Intergenic <.0001 X X
Total < e-4 34 13 12
Fis-D = Fisher dominant test; Pea = Pearson 2 test; Tr-A = trend additive test; All = allelic test; Fis-R = Fisher
recessive test; Tr-D = trend dominant test; Log-A = logistic linear test; Tr-R = trend recessive.
$Chromosome location of single nucleotide polymorphism (SNP).
10-4 criteria according to the Trend-R test and 12 were positive according to
the Logistic-A test.
Thus, 12 of 34 association tests at the e-4 level overlap between the MOI-
agnostic Pearson test and both the recessive Tr-R test and the Log-A test. We
can therefore replicate the Schymick et al. results but only if we use the Pearson
test, which is commonly used for GWAS. The Trend-A test is equally popular.
However, if another MOI-specific test had been used instead, we could not
have replicated the results.
Other ALS GWAS. The year we first investigated the Schymick et al.1 ALS
GWAS study, there were seven other GWAS also published on ALS. These
studies were Dunckley et al.6; van Es et al.7,8; Blauw et al.10; Cronin et al.11,12;
and Chi et al.9 The focus of the Blauw et al.10 study was unique. It investigated
copy number variations, and the reported results were negative. We examined
the top markers reported in the remaining six studies and summarize that
information in Table 2.3. The DPP6 marker first identified by van Es et al.8
is worth mentioning because it was also reported in one of the Cronin et al.
studies.11
Table 2.3. Association test results of SNPs identified as significant in other studies
p-value p-value this
Study SNP ID Chr Gene reported study
Dunckley6 rs6700125 1 FLJ10986 1.8 10-5 3.1 10-3
Dunckley6 rs6690993 1 FLJ10986 2.0 10-4 NA
van Es7 rs2306677 12 ITPR2 7.0 10-4 X
van Es,8 Cronin11 rs10260404 7 DPP6 5.04 10-8 7.4 10-4
Chio9 rs2708909 7 SUNC1 6.98 10-7 4.5 10-3
Chio9 rs2708851 7 Intergenic 1.16 10-6 9.5 10-3
NA = SNP not included on chip; X = not significant at any level; SNP = single nucleotide polymorphism.
The results summarized in Table 2.3 identify a p-value < 10-3 for one of
the markers reported in other studies; however, all of the five others are less
significant. At this threshold, we are unable to replicate any other reported
results and therefore must hypothesize that the collective ALS studies have
identified no predisposing biological markers. We also performed a linkage
disequilibrium (LD) analysis of all SNPs in Table 2.3 to identify any other SNP
close to the index SNP in column 2 (i.e., < 200kb of the index SNP) with high
LD. We were unable to identify any.
26 Chapter 2
Conclusions
We used publicly available data that contained 276 cases and 271 controls.
This sample is very underpowered, with many SNPs containing few disease
alleles. The SNP coverage may also have been insufficiently dense. Although
the SNPs on the chip are tag SNPs that were selected to represent all the
genes in a comprehensive manner, the selection may not have included all the
potential markers for a specific disease. Other explanations are that the sample
of individuals was too heterogeneous, resulting in the study populations with
distinct disease penetrance traits; the strength on the association was too
weak to be detected; errors in the data perturbed the measurements; or the
phenotype definition differed across studies. Novel statistical methods cannot
overcome the problems inherent in poor quality data containing too few
subjects, or data that uses a poorly defined phenotype (i.e., an ALS diagnosis).
We found that different statistical tests varied significantly in estimating
the test association measurement, implying that GWAS results depend on the
chosen statistical method. We also find that there is no compelling standard
that establishes which statistical methods investigators should use in the
context of GWAS with unknown MOI properties. Although GWAS have
used and reported findings for a number of different tests, most tests have
unique properties and consequently can prescribe a different candidate set of
phenotype SNP associations. In general, the p-value threshold is Bonferroni
corrected to a very small value, which encourages high type II error rates.38
GWAS hold substantial promise, but for many phenotypes, the path
forward is complicated for many reasons not yet understood. The replicability
of results in some but not all studies suggests that researchers should present
any and all conclusions with caution. Identifying associations solely on the
basis of extreme p-values is likely to be misleading because an extreme p-value
alone does not identify the underlying biological mechanism that produces the
association. Furthermore, the rate of missing genotype measures suggests that
the genotype data contains an unknown number of errors (e.g., in the case of
ALS, obtaining an accurate diagnosis is notoriously difficult). Accurate error
rates are important components for assessing statistical power properties, and
their absence will lead to underpowered GWAS.
Genome-Wide Association Data: Where Are the Standards? 27
Chapter References
1. Schymick JC, Scholz SW, Fung HC, et al. Genome-wide genotyping in
amyotrophic lateral sclerosis and neurologically normal controls: first
stage analysis and public release of data. Lancet Neurol. 2007;6(4):322-8.
2. MedicineNet.com. Amyotrophic lateral sclerosis (ALS or Lou Gehrigs
Disease). 2015 Nov 23 [cited 2015 Nov 23]; Available from: http://www.
medicinenet.com/amyotrophic_lateral_sclerosis/article.htm
3. Orrell RW. Understanding the causes of amyotrophic lateral sclerosis. N
Engl J Med. 2007;357(8):822-3.
4. Hardiman O, Greenway M. The complex genetics of amyotrophic lateral
sclerosis. Lancet Neurol. 2007;6(4):291-2.
5. Pasinelli P, Brown RH. Molecular biology of amyotrophic lateral sclerosis:
insights from genetics. Nat Rev Neurosci. 2006;7(9):710-23.
6. Dunckley T, Huentelman MJ, Craig DW, et al. Whole-genome analysis of
sporadic amyotrophic lateral sclerosis. N Engl J Med. 2007;357(8):775-88.
7. van Es MA, Van Vught PW, Blauw HM, et al. ITPR2 as a susceptibility
gene in sporadic amyotrophic lateral sclerosis: a genome-wide association
study. Lancet Neurol. 2007;6(10):869-77.
8. van Es MA, van Vught PW, Blauw HM, et al. Genetic variation in DPP6 is
associated with susceptibility to amyotrophic lateral sclerosis. Nat Genet.
2008;40(1):29-31.
9. Chio A, Schymick JC, Restagno G, et al. A two-stage genome-wide
association study of sporadic amyotrophic lateral sclerosis. Hum Mol
Genet. 2009;18(8):1524-32.
10. Blauw HM, Veldink JH, van Es MA, et al. Copy-number variation in
sporadic amyotrophic lateral sclerosis: a genome-wide screen. Lancet
Neurol. 2008;7(4):319-26.
11. Cronin S, Berger S, Ding J, et al. A genome-wide association study of
sporadic ALS in a homogenous Irish population. Hum Mol Genet.
2008;17(5):768-74.
12. Cronin S, Greenway MJ, Prehn JH, et al. Paraoxonase promoter and
intronic variants modify risk of sporadic amyotrophic lateral sclerosis.
J Neurol Neurosurg Psychiatry. 2007;78(9):984-6.
28 Chapter 2
36. Mackenzie IR, Bigio EH, Ince PG, et al. Pathological TDP-43 distinguishes
sporadic amyotrophic lateral sclerosis from amyotrophic lateral sclerosis
with SOD1 mutations. Ann Neurol. 2007;61(5):427-34.
37. Slowik A, Tomik B, Wolkow PP, et al. Paraoxonase gene polymorphisms
and sporadic ALS. Neurology. 2006;67(5):766-70.
38. Balding DJ. A tutorial on statistical methods for population association
studies. Nat Rev Genet. 2006;7(10):781-91.
CHAPTER 3
Overview
We used simulation methods to compensate for the absence of both methodo
logical standards and a process for evaluating the statistical methods used in
predicting associations between genes and phenotypes. This required creating a
synthetic gene database of simulated data linked to known association outcomes
constituting a truth set. Using these data, we analyzed the simulated data
using candidate statistical methods for the purpose of assessing each methods
predictive properties in the context of an experimental setting that we create.
Our method for generating the synthetic marker data is based on Mendelian
concepts of inheritance and epidemiological concepts of relative risk (RR),
which is the ratio of the probability of an event occurring in an exposed group
to the probability of the event occurring in a comparison, nonexposed group.
For example, nonsmokers who inhale secondhand smoke may be more likely to
develop lung cancer than nonsmokers who have not been exposed. Individuals
also have genetic inheritance elements that include autosomal dominant and
autosomal recessive patterns conforming to single-gene inheritance effects. We
also incorporate additive and multiplicative inheritance patterns to represent
the actions of multifactorial inheritance processes.
We used a study by Iles to represent the contrast between a formal
disease diagnosis that stems from genetic causes and the concept of disease
penetrance.1 Penetrance in genetics is the proportion of individuals carrying
a particular variation of a gene (allele or genotype) that also express an
associated trait. We designate a as the risk allele, and A as the allele without
risk. Generating the synthetic gene data set was facilitated by defining the
relationships between penetrance and relative risk for different MOI categories.
This chapter describes a generic process to generate genotype-phenotype
data that we use in Chapters 4 through 8. All of the chapters use a version of
the data generation process that is derived from this generic process, but there
can be differences in detail that depend of the technical content of the chapter.
32 Chapter 3
aaP, (3.2)
where aa = one of the assigned risk factors and P is one of the assigned
penetrance factors.
y = aAP, (3.4)
where aA = one of the assigned risk factors and P is one of the assigned
penetrance factors.
Using the estimate of x and y, assign a case or control at random
using the four different MOI models in conjunction with equations
(3.2) and (3.4) and Table 3.2. We assigned cases in proportion to x (y)
and controls in proportion to 1- x (1-y) for the minor homozygote
(heterozygote) genotypes respectively. For the MOI models that assume
an elevated risk from the minor and the hetero genotypes, we would
expect a higher proportion of cases to be more easily identified via the
statistical procedures. The specification of risk depends on specific and
unknown disease mechanisms. A relative risk of 1.7 is considered strong
and is associated with positive replication,4 and a risk of 1.3 is considered
by Ziegler et al.5 to be a realistic assumption for complex diseases. In
summary, individuals are either assigned as cases or controls according to
the probabilities given in Table 3.2.
Loop
components:
P
MOI
N1
N1
Perr
Gerr
Replicates
Loop
Alter genotype from error rate (0 to 1, 1 to 2, 2 to 0)
No Yes
Completed loops? End
Computational Requirements
Note that the number of unique entity combinations being simulated (NS) as
described by the data generation process (see Table 3.1) is:
NS = NPNXNYNGRNER. (3.5)
Chapter References
1. Iles MM. Effect of mode of inheritance when calculating the power of a
transmission/disequilibrium test study. Hum Hered. 2002;53(3):153-7.
2. Schymick JC, Scholz SW, Fung HC, et al. Genome-wide genotyping in
amyotrophic lateral sclerosis and neurologically normal controls: first stage
analysis and public release of data. Lancet Neurol. 2007;6(4):322-8.
3. Chan EK, Hawken R, Reverter A. The combined effect of SNP-marker and
phenotype attributes in genome-wide association studies. Anim Genet.
2009;40(2):149-56.
4. Sladek R, Rocheleau G, Rung J, et al. A genome-wide association study
identifies novel risk loci for type 2 diabetes. Nature. 2007;445(7130):881-5.
5. Ziegler A, Konig IR, Thompson JR. Biostatistical aspects of genome-wide
association studies. Biom J. 2008;50(1):8-28.
CHAPTER 4
Overview
Choosing a particular statistical method for a study significantly affects
the power profiles of genome-wide association study (GWAS) predictions.
Previous simulation studies of a single synthetic phenotype marker
determined that the gene model or mode of inheritance (MOI) was a major
influence on power. In this chapter, we compare the power profiles of GWAS
statistical methods, ones that combine MOI specific methods into multiple
test scenarios, against individual methods that may or may not assume an
MOI gene model consistent with the marker that predicts the association.
Combining recessive, additive, and dominant individual tests are combined
and used with either the Bonferroni correction method or the MAX test2
with respect to single-test GWAS-based methods. If the gene model behind
the associated phenotype is not known, a multiple test procedure could have
significant advantages compared to single test procedures.
We found that the best statistical method for a study depends on the MOI
gene model associated with the phenotype (diagnosis) in question. Our results
also indicate that a common assumption that the MOI of the locus associated
with the diagnosis is additive will have adverse prediction consequences if the
assumption is incorrect.
Chapter 4 is based on a study that was published in the Journal of Proteomics & Bioinformatics.1
Copyright: 2010 Cooley P, et al. This is an open-access article distributed under the terms of
the Creative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original author and source are credited. Chapter 3 of
the current publication describes the generation of the synthetic gene database. This section was
removed from Chapter 4. The analysis of the gene data has not changed.
38 Chapter 4
Introduction
In this chapter, we examine the statistical methods used to perform GWAS.
GWAS usually apply univariate statistical tests to each gene marker or single
nucleotide polymorphism (SNP) as an initial step. This SNP based test is
statistically straightforward and the testing is done with standard methods
(e.g., 2 tests, regression) that have been studied outside of the GWAS context.
A paper by Kuo & Feingold3 described the most commonly used methods, and
the authors note the use of a compound procedure that combines two or more
statistical tests.
The literature also contains a number of papers that compare statistical
power among subsets of these methods.4,5 However, the question of which
method is best suited to univariate scanning in a GWAS remains an open issue.
The choice of method depends on the match between the true genetic model
underpinning the association and the type of model assumed by the method.
To investigate further, we used a multiple test procedure that combined the
most promising of the methods identified in the literature and applied them
to a set of synthetic marker data with known properties (the Schymick et al.
data set introduced in Chapter 2).6 Our goal was to identify marker properties
that could be linked to optimal methods (with reference to statistical power)
for predicting associations in GWAS. We know from prior studies that the
statistical procedure a researcher chooses influences GWAS prediction
accuracy and that there are specific properties of the underlying markers that
determine the optimization of the procedure choice.3
We also included the important properties that influence the association
prediction accuracy into our synthetic marker data via a Monte Carlo
simulation process, and we link the properties to the influencing marker to
study their individual and collective contributions to association prediction.
A synthetic marker data set allows us to assess the performance of different
statistical methods in a GWAS context. We applied a number of statistical
methods to the simulated data and used their statistical power profiles to
Genetic Inheritance and Genome-Wide Association Statistical Test Performance 39
Methods
We examined the accuracy of association detection by generating synthetic
data with properties that are known to influence statistical power. We used
a Monte Carlo process to generate the data from a set of random variables
described in Chapter 3. The main purpose of the synthetic data from Schymick
et al.6 is to act as a truth set to assess the performance of commonly used
statistical methods used in a GWAS context. Please see Chapter 3 for the
description of how the data were generated.
The simulated data set that was generated had the following characteristics:
The proportion of cases (controls) that are major homozygotes = 50.3 (63.0)
percent.
The proportion of cases (controls) that are heterozygotes = 39.2 (31.3)
percent.
40 Chapter 4
The proportion of cases (controls) that are minor homozygotes = 10.5 (5.7)
percent.
With MOI distribution:
recessive = 25 percent,
dominant = 25 percent,
additive = 25 percent, and
multiplicative = 25 percent.
We acknowledge that this distribution of MOI traits does not represent
how inheritance traits are distributed in humans. The Online Mendelian
Inheritance in Man (OMIM)7 provides the best source of information on the
MOI distribution (Table 4.1). However, OMIM is disproportionally populated
by genes linked to single Mendelian disorders. Therefore, genes associated with
multifactorial disorders are under-represented in OMIM. Because polygene
influences are assumed to be a major source of additive and multiplicative
SNP behavior, the distribution in Table 4.1 is likely biased. Accordingly, we
populated SNPs in our data with equal MOI representation and acknowledge
that it does not represent the true distribution.
The three optimal MOI specific methods are the three variations of the
CA trend test described in Zheng and Gastwirth.8 We also included a fourth,
commonly used individual method, the 2-degrees-of-freedom (2df) genotype
association test. Using the notation in Table 4.2 to define the 23 table of
case-control counts stratified by genotype, a one-tailed test statistic (T2(x)) for
the three variations of the CA trend methods is defined as:
[0,1,2 { xi (s ri - r si)}]2
T2(x) = n . (4.1)
[r s (0,1,2 n {xi xi ni } {0,1,2 (xi ni)2 })]
Genetic Inheritance and Genome-Wide Association Statistical Test Performance 41
The values represented in equations (4.1) and (4.2) are shown in Table 4.2,
and the value of xi defines the specific test x0 = 0, x2 = 1 and x1 = {0 recessive,
.5 additive, 1 dominant).
[0,1,2 { xi (s ri - r si)}]
N(x) = n1/2 . (4.2)
[r s (n 0,1,2 {xi xi ni} [0,1,2 {xi ni}2]1/2)]
Table 4.3. Power results, by statistical method and number of cases: additive mode of
inheritance data
N CA-A BON MAX CA-D CA-R 2df-G
100 2.26* 1.73 1.91** 0.62 0.46 0.00
250 13.47* 12.22 12.25** 9.46 5.56 2.41
500 30.00* 28.12** 28.06 24.19 12.63 10.48
1,000 49.91* 48.31** 48.26 45.72 25.11 25.21
2,000 68.80* 67.40** 67.31 64.72 41.08 46.35
4,000 83.00* 81.86 81.94** 80.25 58.24 65.96
8,000 94.30* 93.69 93.72** 92.58 72.36 80.52
9,500 96.07* 95.63** 95.46 95.01 75.66 83.90
2df-G = 2-degrees-of-freedom genotype association test; BON = Bonferroni test; CA-A = autosomal additive
Cochran-Armitage test; CA-D = autosomal dominant Cochran-Armitage test; CA-R = autosomal recessive
Cochran-Armitage test; MAX = MAX combined test.
* Best power score ** Second best power score
Our results indicate that the best method in terms of statistical power is
CA-A, but that little is lost if the BON or MAX method is used instead. Also
there is little difference between the BON and MAX methods. Similarly, the
results in Tables 4.4, 4.5, and 4.6 indicate that the best method in terms of
statistical power for identifying dominant MOI loci is CA-D, and CA-R for
recessive MOI loci. For multiplicative MOI loci, the best method is CA-A.
In all four scenarios, little is lost if the MAX or BON method is used as a
replacement.
Table 4.4. Power results, by statistical method: dominant mode of inheritance gene data
N CA-A BON MAX CA-D CA-R 2df-G
100 0.05 0.07 0.40** 0.13* 0.00 0.00
250 2.09 2.94** 2.88 3.79* 0.00 0.00
500 12.89 15.01 15.02** 16.52* 0.01 1.21
1,000 32.75 34.75 34.84** 36.94* 0.19 12.79
2,000 54.80 56.33 56.55** 57.97* 2.42 32.75
4,000 71.01 73.48** 73.46 74.90* 11.24 54.50
8,000 85.15 86.95 87.18** 88.09* 26.71 71.98
9,500 88.27 90.15** 90.10 91.23* 31.55 76.39
2df-G = 2-degrees-of-freedom genotype association test; BON = Bonferroni test; CA-A = autosomal additive
Cochran-Armitage test; CA-D = autosomal dominant Cochran-Armitage test; CA-R = autosomal recessive
Cochran-Armitage test; MAX = MAX combined test.
* Best power score ** Second best power score
44 Chapter 4
Table 4.5. Power results, by statistical method: recessive mode of inheritance gene data
N CA-A BON MAX CA-D CA-R 2df-G
100 0.00 0.00 0.38** 0.00 0.00* 0.00
250 0.02 0.11 0.68** 0.00 0.19* 0.00
500 0.42 1.70 1.77** 0.00 2.11* 0.01
1,000 2.71 7.88 7.89** 0.01 8.95* 1.04
2,000 9.97 19.91 19.99** 0.13 21.40* 6.50
4,000 21.84 34.29 34.39** 1.50 35.95* 18.45
8,000 34.94 49.35 49.67** 7.17 51.01* 33.14
9,500 38.04 53.07 53.11** 9.32 54.62* 36.73
2df-G = 2-degrees-of-freedom genotype association test; BON = Bonferroni test; CA-A = autosomal additive
Cochran-Armitage test; CA-D = autosomal dominant Cochran-Armitage test; CA-R = autosomal recessive
Cochran-Armitage test; MAX = MAX combined test.
* Best power score ** Second best power score
However, regardless of MOI, power is lost if we use the CA-A method. Many
researchers use an additive model as the initial GWAS pass. What if the locus in
question is recessive or dominant? Table 4.7 indicates that although the CA-A
method is the optimal choice (by as much as 2 percent), if the MOI of the locus
is additive or multiplicative, there is a risk of more than 2 percent power loss
if the locus MOI is dominant and as much as 15 percent loss if the MOI of the
locus is recessive.
Genetic Inheritance and Genome-Wide Association Statistical Test Performance 45
Table 4.7. Power results, by CA-A and MAX methods, for different MOI gene models
Loss using Loss using
N CA-A (D) MAX CA-A CA-A (R) MAX CA-A
100 0.05 0.40 .35 0.00 0.38 .38
250 1.45 2.88 1.43 0.02 0.68 .66
500 12.89 15.02 2.13 0.42 1.77 1.35
1,000 32.75 34.84 2.09 2.71 7.89 5.18
2,000 54.80 56.55 1.75 9.97 19.99 10.02
4,000 71.01 73.46 2.45 21.84 34.39 13.55
8,000 85.15 87.18 2.03 34.94 49.67 14.73
9,500 88.27 90.10 1.83 38.04 53.11 15.07
CA-A (D) = Cochran-Armitage method, additive model version applied to dominant mode of inheritance
(MOI) single nucleotide polymorphism (SNP) data; CA-A (R) = Cochran-Armitage method, additive model
version applied to recessive MOI SNP data; MAX = MAX combined test.
If we knew the distribution of the MOI property, we could assess the overall
risk of using an additive method such as CA-A for GWAS. However, without
a reliable estimate, researchers should exercise caution and apply a procedure
that limits the risk of incorrectly assessing the MOI inherent to the locus-
inducing diagnosis.
Discussion
In the literature, many statistical methods that have been used to perform
GWAS assume a MOI specific hypothesis. Our results confirm the work of
many others.9 In the context of a single-marker scenario, the best method
for predicting associations in recessive SNPs is the CA-R method; the best
method for dominant MOI SNPs is the CA-D method; and the best method
for additive and multiplicative SNPs is the CA-A method.
We also show that the 2df genotype method used in many studiesfor
example, the method used by Schymick et al.6is never optimal because there
are always other methods that provide greater statistical power. This statement
holds regardless of whether the MOI is known a priori or not. We also show
that in the context of a general method to use in the initial GWAS pass,
researchers may encounter adverse consequences if, for example, the MOI of
the operating locus is not consistent with the assumption employed by the
statistical method used. Therefore, 2df appears to be inappropriate to use for
GWAS under any circumstances.
46 Chapter 4
Chapter References
1. Cooley P, Clark R, Folsom R, et al. Genetic inheritance and genome
wide association statistical test performance. J Proteomics Bioinform.
2010;3(12):321-325.
2. Li Q, Zheng G, Li Z, et al. Efficient approximation of P-value of the
maximum of correlated tests, with applications to genome-wide
association studies. Ann Hum Genet. 2008;72(Pt 3):397-406.
3. Kuo CL, Feingold E. Whats the best statistic for a simple test of genetic
association in a case-control study? Genet Epidemiol. 2010;34(3):246-253.
4. Sasieni PD. From genotypes to genes: doubling the sample size.
Biometrics. 1997;53(4):1253-61.
Genetic Inheritance and Genome-Wide Association Statistical Test Performance 47
Overview
The influence of genotype and diagnosis errors present in genome-wide
association studies (GWAS) was assessed by analyzing a synthetic gene data
set incorporating factors known to influence association measurement. Monte
Carlo methods were used to generate the synthetic gene data that incorporated
factors that influenced including gene inheritance, relative risk levels, disease
penetrance, genotype distribution, sample size, as well as the two error factors
that are the focus of this study. The resulting data set provides a truth set for
assessing statistical method performance and association sensitivity.
Our results quantify the relationship between genotype and diagnosis
error measures and statistical power loss. The connection between these
relationships are understood, but we document their extent. Our results also
demonstrate that for low-risk nonrecessive loci, sample sizes in the range of
1,000 to 2,000 cases will achieve 80 percent power thresholds for type-I error
levels of 10-8, even with realistic genotype and phenotype error assumptions.
Nevertheless, increasing sample size is a viable method of compensating for
power loss caused by genotype and diagnosis errors. Our estimates indicate
that sample sizes should be increased by 20 to 40 percent, depending on the
gene inheritance model assumed.
Chapter 5 is based on a study that was published in the Journal of Proteomics & Bioinformatics.1
Copyright: 2011 Cooley P, et al. This is an open-access article distributed under the terms of
the Creative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original author and source are credited. Chapter 3 of
the current publication describes the generation of the synthetic gene database. This section was
removed from Chapter 5. The analysis of the gene data has not changed.
50 Chapter 5
Introduction
More than 2,306 human GWAS have examined more than 1,000 diseases
and traits and found more than 1,200 single nucleotide polymorphism (SNP)
associations.2 With improved genotyping technologies and the growing
number of available markers, case-control GWAS have become a key tool
for investigating complex diseases. Because GWAS have become a standard
primary investigative tool, researchers need to be aware of how errors
influence their studies and how to overcome or compensate for them. The
initial step in a GWAS is to apply univariate statistical tests for each SNP in
the data set. Applying the tests is statistically straightforward and uses several
standard approaches (e.g., 2 tests, regression methods).
Studies on the consequences of genotype error have led to a modest number
of investigations in the statistical genetics literature. Gordon and colleagues
investigated the effects of three published models of genotyping errors on the
2df genotype 2 test.3 In another study, Gordon and colleagues described a
statistical power calculator (PAWE-3D) that produces power and sample size
calculations that can support study designs for GWAS and compute power
and/or sample size requirements for a specified significance level.4 Zheng
and Tian, as well as Edwards and colleagues contributed to the development
of PAWE.5,6 Gordon and colleagues further analyzed the influence of both
random phenotype and genotype misclassification errors on statistical power
contrasting the Cochran-Armitage additive test (CA-A) with the 2-degrees-of-
freedom (2df) genotype test and concluded that the CA-A is more powerful.7
Ahn and colleagues addressed the effect of different types of genotyping
errors on statistical power in GWAS.8 Although their prior work focused on
non-differential genotype error rates, this study considered errors in each
of the three bi-allelic genotypes differentially. The methods were based on
a Taylor-series expansion of a noncentrality parameter of the asymptotic
distribution of the trend test. In a follow-up study, Ahn and colleagues
extended their work by developing a closed form analytic procedure for both
the 2df genotype and the CA-A tests.9 They reported that misclassifying the
heterozygote genotype is particularly detrimental when using the Cochran-
Armitage recessive trend test (CA-R) on data from a recessive mode of
inheritance (MOI) model.
Although the accuracy of the genotyping process has improved, data
errors still occur. Hao and colleagues reported an overall 0.5 percent error
rate imputation process, but they also reported a 2 percent error rate in
The Influence of Errors Inherent in Genome-Wide Association Studies 51
error rates) and phenotype (diagnosis) errors produce substantial power losses
for all MOIs, with significant power losses for recessive MOIs. Because GWAS
involving recessive loci have additional power requirements relative to other
MOI types, researchers need to address these requirements in developing
appropriate sample sizes for their studies.
Methods
Our approach is based entirely on a simulation framework. Chapter 3
describes in detail the data generation method that was used to produce our
synthetic data and how we used these data to assess the influence of errors on
statistical power loss.
We developed our assessments by analyzing a data set of synthetic gene data
that incorporates factors that we know influence association measurements in
GWAS. These include phenotypic errors (i.e., those caused by improper disease
diagnosis) and genotype errors (e.g., those caused by incorrect genotype
calls). We employed Monte Carlo methods to generate simulated gene data
that we analyzed to assess the influence of the individual factors on statistical
power in the context of GWAS. There are two advantages to using simulated
data. First, the association-affecting factors are isolated and can be linked to
the affecting locus. Second, we can choose any specific statistical method to
perform the association assessment. The simulated data set provides a truth
set for assessing the role of statistical methods on association sensitivity and
highlights the particular role of errors in disease diagnosis and incorrect
genotype assignments.
Results
Data Set Summary
Using the methods described in Chapter 3, we generated a synthetic gene
simulated data set with the following characteristics:
The proportion of cases (controls) that are major homozygotes = 50.3 (63.0)
percent.
The proportion of cases (controls) that are heterozygotes = 39.2 (31.3) percent.
The proportion of cases (controls) that are minor homozygotes = 10.5 (5.7)
percent.
With MOI distribution:
The Influence of Errors Inherent in Genome-Wide Association Studies 53
recessive = 25 percent,
dominant = 25 percent,
additive = 25 percent, and
multiplicative = 25 percent.
Although this distribution of MOI traits does not represent a true
distribution, we currently know of no accurate way to obtain such a
distribution. Consequently, although we gave each of the four MOI traits equal
representation in the simulated data, we confined our examinations to within-
MOI assessments.
Factor Assessment
To check the influence of the factors (relative risk, penetrance, MOI, genotype
error rates, phenotype error rates) built into the data for each experiment, we
fitted the three distinct models to the four MOI subsets of the data: that is,
if the index i represents the ith (of 12) experiments, where the model can be
described as:
With a single exception, the signs of the estimated phenotype error term
(3) and genotype error term (4) coefficients were consistent and the
magnitudes of the estimated coefficients similar. All estimates were significant
at the 10-7 level, suggesting that the factors incorporated in the simulated data
are all significant.
Error Analyses
To simulate the association estimation process in a GWAS experiment, we
applied three variations of the Cochran-Armitage (CA) trend test to each of
the 1,000 replicates of the 3,456 possible data subsets. Each of the variations of
the CA tests used a distinct genotype score vector: [0,0,1] for recessive, [0,0.5,1]
for additive and [0,1,1] for dominant. We applied each of the tests to all of the
replicates in each of the data subsets. This process allowed us to show that the
optimal strategy for maximizing statistical power is MOI specific. This strategy
posits that the recessive version (CA-R) be used to estimate associations
involving recessive loci data, the dominant version (CA-D) for the dominant
loci data, and the additive version (CA-A) for both additive and multiplicative
54 Chapter 5
loci data. This strategy is cited by others for single-gene models.18 Cooley and
colleagues provided a similar assessment and identified a multiple test strategy
that combines the three tests into an overall score that has merit if the MOI of
the causative loci is not known.19 For the assessment in this study, we assumed
that the MOI is known and selected the best statistical method to measure the
association. Consequently, our results tended to be optimistic.
Error rates of 0, 2, and 5 percent are incorporated into the simulated data
for the phenotype and 0, .5, and 1 percent for the genotype. Our approach
combines the three risk levels (mean risk = 1.3), three penetrance levels
(mean penetrance = 0.4), and groups the data into a with error (mean
phenotype error = 3.5 percent, and mean genotype error 0.75 percent) and
without error strata. Also, we also stratified the analysis by MOI. Figures
5.1a through 5.4b identify the four MOI-specific results. Each figure includes
a 0.75 percent genotype error curve, a 3.5 percent phenotype error curve, a
curve that includes both error sources, and a curve generated without either
source of error. Figure 5.1 presents the recessive loci analysis. The impact of a
0.75 percent average genotype error rate and a 3.5 percent average diagnosis
error rate with respect to power loss for recessive loci is nontrivial. However
each profile exhibits distinct behavior. The effect of the phenotype error
increases with N and peaks at N = 4,000 cases, whereas the genotype error
effect is constant across all N. Also observed at the peak is a genotype impact
of 6.04 percent power loss per 1.0 percent genotype error and a power loss
of 3.03 percent power loss per 1.0 percent phenotype error. Note that with
< 10-8 as the significance threshold, an 80 percent power target is far from
being realized even with N = 5,000 cases and controls.
Figures 5.2, 5.3, and 5.4 present the error effects of the dominant, additive,
and multiplicative loci respectively. All three figures indicate that power
loss is nontrivial for the MOI categories they represent but that the effect is
substantially less than recessive modes.
The pattern of the power profile for dominant loci is in sharp contrast to
the recessive loci profile. The diagnosis error pattern is constant across N for
dominant locithe recessive loci show an increasing pattern. The genotype
patterns are also different. As N increases, the power differences decline for the
dominant loci, whereas the patterns are constant for recessive loci.
The additive and the multiplicative loci show similar error profiles. The
impact of a 3.5 percent diagnosis error and the impact of a 0.75 percent
genotype error have a similar quantitative impact. Both have a declining power
loss as N increases.
The Influence of Errors Inherent in Genome-Wide Association Studies 55
70
Statistical power (percent)
60
50 Perr = 3.5%
Gerr = .75%
40
No errors
30 Both errors
20
10
0
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Gerr = genotype error level Both errors = both genotype and phenotype errors
Perr = phenotype (diagnosis) error level No errors = neither error
Figure 5.1b. Power loss: total, genotype, and diagnosis: recessive loci
15
12
Statistical power (percent)
9 Loss diagnosis
Loss genotype
6 Loss total
0
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Loss genotype = power difference between Gerr = .75 and no errors assumption.
Loss phenotype (diagnosis) = power difference between Perr = 3.5 and no errors assumption.
Loss total = power difference between both Perr = 3.5 & Gerr = .75 and no errors assumption.
56 Chapter 5
90
Statistical power (percent)
80
Perr = 3.5%
Gerr = .75%
70
No errors
Both errors
60
50
40
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Gerr = genotype error level Both errors = both genotype and phenotype errors
Perr = phenotype (diagnosis) error level No errors = neither error
Figure 5.2b. Power loss: total, genotype, and diagnosis: dominant loci
7
6
Statistical power (percent)
4 Loss diagnosis
Loss genotype
3 Loss total
0
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Loss genotype = power difference between Gerr = .75 and no errors assumption.
Loss phenotype (diagnosis) = power difference between Perr = 3.5 and no errors assumption.
Loss total = power difference between both Perr = 3.5 & Gerr = .75 and no errors assumption.
The Influence of Errors Inherent in Genome-Wide Association Studies 57
90
Statistical power (percent)
Perr = 3.5%
80
Gerr = .75%
No errors
70
Both errors
60
50
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Gerr = genotype error level Both errors = both genotype and phenotype errors
Perr = phenotype (diagnosis) error level No errors = neither error
Figure 5.3b. Power loss: total, genotype, and diagnosis: additive loci
7
6
Statistical power (percent)
4 Loss diagnosis
Loss genotype
3 Loss total
0
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Loss genotype = power difference between Gerr = .75 and no errors assumption.
Loss phenotype (diagnosis) = power difference between Perr = 3.5 and no errors assumption.
Loss total = power difference between both Perr = 3.5 & Gerr = .75 and no errors assumption.
58 Chapter 5
90
Perr = 3.5%
Gerr = .75%
80
No errors
Both errors
70
60
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Gerr = genotype error level Both errors = both genotype and phenotype errors
Perr = phenotype (diagnosis) error level No errors = neither error
Figure 5.4b. Power loss: total, genotype, and diagnosis: multiplicative loci
7
6
Statistical power (percent)
4 Loss diagnosis
Loss genotype
3 Loss total
0
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Loss genotype = power difference between Gerr = .75 and no errors assumption.
Loss phenotype (diagnosis) = power difference between Perr = 3.5 and no errors assumption.
Loss total = power difference between both Perr = 3.5 & Gerr = .75 and no errors assumption.
The Influence of Errors Inherent in Genome-Wide Association Studies 59
In summary, the four figures indicate that the genotype error versus the
diagnosis errors effects vary by MOI. For the recessive MOI, a 3.5 percent
diagnosis error has a larger impact than a 0.75 percent genotype error. This
result is reversed in the dominant MOI scenarios (Figure 2). The additive and
the multiplicative MOI scenarios represented in Figures 3 and 4 indicate that
a 0.75 percent genotype error is comparable in effect to a 3.5 percent diagnosis
error with respect to power loss.
These results are summarized in Table 5.1, which displays the power loss
for the smallest sample size (N = 500 cases) and the largest sample size (N =
5,000 cases). For example, row R (recessive) of Table 5.1 illustrates that error
loss due to genotype errors at N = 500 and N = 5,000 is flat, but that error loss
due to diagnosis error increases dramatically from N = 500 to N = 5,000 and
dominates the total error profile. The power loss pattern changes for the other
three MOIs where error loss patterns for both genotype and diagnosis sources
decline from N = 500 to N = 5,000.
80
Statistical power (percent)
1.15 no errors
1.15 both errors
60
1.3 no errors
1.3 both errors
40
1.45 no errors
1.45 both errors
20
0
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Loss genotype = power difference between Gerr = .75 and no errors assumption.
Loss phenotype (diagnosis) = power difference between Perr = 3.5 and no errors assumption.
Loss total = power difference between both Perr = 3.5 & Gerr = .75 and no errors assumption.
80
Statistical power (percent)
1.15 no errors
1.15 both errors
60
1.3 no errors
1.3 both errors
40
1.45 no errors
1.45 both errors
20
0
500 1,000 1,500 2,000 2,500 3,000 4,000 5,000
N (number of subjects)
Loss genotype = power difference between Gerr = .75 and no errors assumption.
Loss phenotype (diagnosis) = power difference between Perr = 3.5 and no errors assumption.
Loss total = power difference between both Perr = 3.5 & Gerr = .75 and no errors assumption.
The Influence of Errors Inherent in Genome-Wide Association Studies 61
Figure 5.6 shows the same results for the recessive scenario. In this recessive
scenario, the likelihood of achieving an 80 percent power level is low and is
only possible for high-risk loci in the absence of genotype and diagnosis error
with a sample size N larger than attempted by our simulation experiments.
Summary/Discussion
We examined the influence of genotype and diagnosis errors that affect the
accuracy of association predictions in a GWAS and focused on assessing the
effect on statistical power loss caused by the influence of these two sources
of error. Our findings are MOI specific and indicate that both sources of
error can adversely affect power levels. This outcome is more pronounced
for recessive MOI and low-risk loci, which is common knowledge. What our
study shows is that the error magnitude depends on a variety of factors in
addition to MOI, especially relative risk and sample size; our study quantifies
this magnitude and indicates the significance of this impact. This loss can be
compensated for by increasing sample sizes. Gordon and colleagues reported
that a 1 percent increase in genotype error rates requires an increase in sample
size of 2 to 8 percent, which they also noted depends on the MOI scenario.3
Our estimates are much higher than those reported by Gordon and colleagues
and are based on achieving a power threshold of 80 percent. Using the additive
model, results at N = 1,000 (assuming no genotype errors) exceeds the 80
percent threshold (80.6 percent). Introducing a 1 percent genotype error,
power drops to 75 percent. An additional 405 cases are needed to compensate
for this loss to restore an 80.6 percent power level, which is a 40.5 percent
increase in sample size. Please note that we are not suggesting that 1 percent
error is standard operating procedure. In fact, genotype errors are improving
with the introduction of each new technology,and currently are likely less than
0.5 percent. Table 5.2 presents these results for all MOIs for both genotype and
diagnosis errors.
Table 5.2. Percent sample size increase to restore power caused by a 1 percent
genotype or diagnosis error
MOI Genotype percentage Diagnosis percentage
R 57.2 35.9
D 40.1 19.7
A 40.5 20.9
M 40.2 18.9
A = additive; D = dominant; MOI = mode of inheritance; M = multiplicative; R = recessive.
62 Chapter 5
Chapter References
1. Cooley P, Clark RF, Page G. The influence of errors inherent in genome
wide association studies (GWAS) in relation to single gene models. J
Proteomics Bioinform. 2011;4:138-144.
2. Burdett T, Hall P, Hasting E, et al. The NHGRI-EBI Catalog of published
genome-wide association studies. 2015 [cited 2015 Nov 2]; Available from:
www.ebi.ac.uk/gwas
3. Gordon D, Finch SJ, Nothnagel M, et al. Power and sample size
calculations for case-control genetic association tests when errors are
present: application to single nucleotide polymorphisms. Hum Hered.
2002;54(1):22-33.
4. Gordon D, Haynes C, Blumenfeld J, et al. PAWE-3D: visualizing power
for association with error in case-control genetic studies of complex traits.
Bioinformatics. 2005;21(20):3935-7.
5. Zheng G, Tian X. The impact of diagnostic error on testing genetic
association in case-control studies. Stat Med. 2005;24(6):869-82.
6. Edwards BJ, Haynes C, Levenstien MA, et al. Power and sample size
calculations in the presence of phenotype errors for case/control genetic
association studies. BMC Genet. 2005;6:18.
7. Gordon D, Haynes C, Yang Y, et al. Linear trend tests for case-control
genetic association that incorporate random phenotype and genotype
misclassification error. Genet Epidemiol. 2007;31(8):853-70.
The Influence of Errors Inherent in Genome-Wide Association Studies 63
Overview
In general, genome-wide association studies (GWAS) apply univariate
statistical tests to each gene marker or single nucleotide polymorphism (SNP)
as an initial step. This SNP-based test is statistically straightforward, and the
core tests for assessing the associations are standard methods (e.g., 2 tests,
regression) that have been studied outside of the GWAS context. Kuo and
Feingold describe the most commonly used statistical methods that are applied
to GWAS.2 All tests cited in the chapter are single-locus tests. If the genetic
inheritance properties are not known, we recommend combining two or more
statistical tests.3 In many cases, the SNPs associated with a disease are not
located in a region of DNA that codes for a protein. Instead, they are located
in the large noncoding regions between genes or in intron sequences, which
are edited out of mRNAs prior to translation to proteins. These regions are
presumably sequences of DNA that modify gene expression, but usually their
functions are unknown.4
The popularity of the GWAS approach belies its simplicity and obscures the
important issue of whether a single-gene model can illuminate the biosynthetic
pathways of a phenotype. In the path leading from gene to trait, factors such
as epigenetics, alternate splicing, gene expression levels, and protein-folding
Chapter 6 is based on a study that was published in the Journal of Proteomics & Bioinformatics.1
Copyright: 2012 Cooley P, et al. This is an open-access article distributed under the terms of
the Creative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original author and source are credited. Minor text
edits were made to the this chapter. The analysis of the gene data has not changed.
66 Chapter 6
linked (in epistasis) or unlinked loci.11,12 These studies purport to have higher
powers to detect interaction than classical logistic regression models.
Our investigations demonstrate that for low-effect loci, single-gene models
of association fail to identify many associations because the interacting locus
masks the effect on the index locus. For the scenarios we tested, our results
also support assessments by Wu and colleagues and Ueki and colleagues that
analytical methods that assume statistical interactions between loci are more
powerful than single-loci models.
In this chapter, we will refer to markers as loci, but more broadly, they
could also be viewed as genes, SNPs, or haplotypes.
Epistasis Analysis
One way to extend the single-gene model to accommodate multiple genes
involves studying gene pairs and their epistatic relationships. Epistasis analysis
is the genetic methodology used to identify which genes act in a particular
cellular process or pathway and to establish an order-of-function map that
reflects the sequence in which those genes act. The analysis typically involves
determining for a pair of genes whether the phenotype of a double mutant
resembles that of a single mutant or whether it is a novel phenotype. Knowing
what type of pathway is being investigated can help establish the type of
relationship between the two genes.
Two types of pathways can be defined: substrate-dependent and switch-
regulatory. Substrate-dependent pathways consist of a specific series of positive
reactions, each of which involves some gene product (e.g., an enzyme) acting
on a substrate produced in the previous step in the pathway and ultimately
producing some final outcome. Switch-regulatory pathways consist of genes
encoding negative or positive regulatory factors that alternate between on
and off states depending upon upstream signaling events, thereby affecting
some downstream response. Because substrate-dependent pathways comprise
only positive factors whereas switch-regulatory pathways can comprise both
positive and negative factors, interpreting results from epistatic studies is
typically less complex for substrate-dependent pathways. Therefore, for the
sake of simplicity, this analysis focuses on substrate-dependent pathways.
A number of studies argue that interacting loci may be the norm and not
the exception. For example, Templeton and colleagues report that experience
has revealed that most complex traits depend on more than one locus.13 Their
68 Chapter 6
study focuses on how often interactions among the loci play a significant role
in the mapping from genotype to phenotype, given that the phenotype is
influenced by two or more loci. They discuss a number of candidate scenarios,
including coronary artery disease, in which the ApoE gene has been shown
to affect males and females differently. Even the reported Mendelian trait
sickle-cell anemia is commonly presented as a single nucleotide trait. A study
by Gilbert-Diamond and colleagues indicates that gene-gene interactions
(epistasis) are a significant complicating factor in the search for disease
susceptibility genes.14
Objective
This chapter investigates epistatic interactions in a GWAS context using a
qualitative association model. The purpose of this exercise is to determine
the statistical methods and models that reliably predict associations between
a qualitative phenotype (specifically, a disease diagnosis, coded as case,
for a positive diagnosis, or control, for a negative diagnosis) and a pair of
interacting genes. As with our other work, we use the concept of relative risk,
the ratio of the probability of a positive diagnosis given a specific genotype and
epistatic model (EM) divided by the probability with no risk present (i.e., P).
The value of P is specified exogenously.
Methods
We employed a Monte Carlobased simulation method to generate synthetic
data corresponding to a variety of possible epistatic models for substrate-
dependent pathways. The method takes into account factors known to
influence association measurements in GWAS, including the relative risk
of association, disease prevalence in nonrisk populations, inheritance
properties of the simulated loci, and most important, the epistatic relationship
of the simulated loci. We then analyzed the simulated gene data to assess
the influence of these individual factors on statistical power in the context
of GWAS. There were two advantages to using simulated data. First, the
association-affecting factors were isolated and could be linked to the affecting
locus. Second, we could choose any specific statistical method to perform the
association assessment.
Conducting Genome-Wide Association Studies (GWAS): Epistasis Scenarios 69
Table 6.2. Epistatic model 1 depicted in terms of risk associated with various
genotype combinations
gene1 D D R R
MOI gene2 D R D R
gene1 gene2 Risk () Risk () Risk () Risk ()
AA BB 1 1 1 1
AA Bb b 1 b 1
AA bB b 1 b 1
AA bb b b b b
Aa BB a a 1 1
Aa Bb a a b 1
Aa bB a a b 1
Aa bb a a b b
aA BB a a 1 1
aA Bb a a b 1
aA bB a a b 1
aA bb a a b b
aa BB a a a a
aa Bb a a a a
aa bB a a a a
aa bb a a a a
MOI = mode of inheritance.
Conducting Genome-Wide Association Studies (GWAS): Epistasis Scenarios 71
x = P . (6.2)
72 Chapter 6
5. Using the estimate of x from equation (6.2), assign a case (0) or control
(1) designation at random. Note that using the 12 different EM/MOI
combinations outlined in Table 6.2 for EMs 13, cases should be linked to
both genetic loci, and this association should be identifiable via appropriate
statistical procedures. Disease risk depends on specific and unknown
disease mechanisms. A relative risk of 1.7 is considered strong and is
associated with positive replication.19 However, a risk of 1.3 is considered to
be a realistic assumption for complex diseases.20 However, many instances
of risk < 1.1 are reported in the literature. We limited our focus to a relative
risk range of 1.10 to 1.25 and were particularly interested in cases with low
relative risk. Note that implicit in equation (6.2) is a definition of prevalence
as the proportion of cases that are present where no genetic risk is assumed.
6. Continue the process until n1 cases and n2 controls have been generated
(note that in this example n1 = n2, but the procedure can be tailored to
specific n1/n2 targets).
Statistical Models
Using the assumptions presented in Table 6.2, we generated 1,000 replicates
of genotypic and phenotypic data for each MOI pair for EM 1 using different
sample sizes and risks. We then investigated the power of different statistical
models to detect genotype-phenotype associations. We analyzed models that
test each gene independently for association with the phenotype and models
that test pairs of genes with and without interaction terms for association.
Single-Gene Methods: Cochran-Armitage Trend Test. The Cochran-Armitage
(CA) trend test is often used as a genotype-based test for case-control genetic
association studies, as described by Purcell and colleagues.21 More generally,
it is used in categorical data analysis to detect the presence of an association
between a variable with two categories (e.g., a diagnosis) and a variable with k
categories (e.g., a genotype). The CA trend test modifies the chi-square test to
incorporate a suspected ordering in the effects of the k categories of the second
variable. For example, one could order the number of mutated alleles as zero,
one, and two and conjecture that the allele effect will not become smaller as
the dose increases.
As described by Zheng and Gastwirth, the CA trend test has three flavors:
dominant (CA-D), recessive (CA-R), and additive (CA-A).22 Using the
notation in Table 6.3 below to define the 23 table of case-control counts
Conducting Genome-Wide Association Studies (GWAS): Epistasis Scenarios 73
stratified by genotype, a test statistic (T2(x)) for the three variations of the CA
trend methods can be defined as:
The variables ri, si and ni in equation (6.3) are defined in Table 6.3. The
variable xi represents the specific test, namely x0 = 0, x2 = 1 and x1 = .5.
T = ( I ) / V ( I ) ,
IH
IH GH
GH 2 GH
GH
H H
IH GH GH
whereWhere
Where
( I=) [ M 1 M 2]
H
H
22
( I=) [ MP 1P M 2]
GH
GH
H 2
AA AA
P11 P22
GH
V IGH= V 1 + V 2
H
1 1 1 1 1
=V1 + A + A + A
1 1AA 1A 1A 1A
=V 1 4 n AA P1111A + P1212A + P2121A + P2222A
4 n A P11 P12 P21 P22
1 1 1 1 1
=V2 + N + N + N
1 1NN 1N 1N 1N
=V 2 4 nGG P1111N + P1212N + P2121N + P2222N .
4 nG P11 P12 P21 P22
The second test,
The second test, T unlinked, assumes
IH TIH unlinked, assumes thatthat
thethe
twotwo
lociloci
areare
unlinked
unlinkedand
andisis defined as:
IH
defined asThe second test, TIH unlinked, assumes that the two loci are unlinked and is defined as:
T ( I ) ( I ) ,
22
H H
= H
/V H
( I ) ( I )
IH
IH GH
GH 2 GH
GH
T =
H H
/V
where
IH GH GH
where
where
( )
I = [ M 1]
H 2
( I ) = [ M 1]
GH
H 22
H
(VI( I ) =) [=MV11] .
GH
GH H 2
H 2
GH GH
P11AP22A
A A
=
recessive or dominant
4 n P gene
P
AA
produced
P
11
11 12
12 21
21
A 11 12 21 22
inheritance agnostic and assumed that a pair of interacting genes produced the
diagnosis. To implement this investigation, we fixed the risk of the upstream
gene of EM1, gene1, to a low but detectable 1.10 risk level. Simultaneously, we
varied the risk on the downstream gene, gene2, from 1.00 (no risk) to 1.20,
a level that is twice as high as the risk of gene1. Note that a no-risk gene is
inconsistent with the purpose of Table 6.2, which identifies the interactions
Conducting Genome-Wide Association Studies (GWAS): Epistasis Scenarios 75
Figure 6.1. Dominant gene1 analysis: D-D crossover risk = 1.07, D-R
crossover risk = 1.12
100
80
Statistical power (percent)
DD single
60
DD double
DR single
40 DR double
20
0
1.00 1.05 1.10 1.15 1.20
Relative risk
DD single = gene1 dominant, gene2 dominant, single-gene test; DD double = gene1 dominant,
gene2 dominant, two-gene test; DR single = gene1 dominant, gene2 recessive, single-gene test;
DR double = gene1 dominant, gene2 recessive, two-gene test
Figure 6.2. Recessive gene1 analysis: R-D crossover risk = 1.05, R-R
crossover risk = 1.07
100
80
Statistical power (percent)
RR single
60
RR double
RD single
40 RD double
20
0
1.00 1.05 1.10 1.15 1.20
Relative risk
RD single = gene1 recessive, gene2 dominant, single-gene test; RD double = gene1 recessive,
gene2 dominant, two-gene test; RR single = gene1 recessive, gene2 recessive, single-gene test;
RR double = gene1 recessive, gene2 recessive, two-gene test
78 Chapter 6
Table 6.7 presents the results for the case in which the risk from the
upstream gene is twice the risk of the downstream gene and demonstrates that
the two-locus, case-control test outperforms all single-gene tests and both of
the refined Wu et al.12 tests in this scenario.
Figure 6.3. Dominant gene2 analysis: D-D crossover risk = 1.0, R-D
crossover risk = 1.12
100
80
Statistical power (percent)
DD single
60
DD double
DR single
40 DR double
20
0
1.00 1.05 1.10 1.15 1.20
Relative risk
DD single = gene1 dominant, gene2 dominant, single-gene test; DD double = gene1 dominant,
gene2 dominant, two-gene test; RD single = gene1 recessive, gene2 dominant, single-gene test;
RD double = gene1 recessive, gene2 dominant, two-gene test
Figure 6.4. Recessive gene2 analysis: R-D crossover risk = 1.05, R-R
crossover risk = 1.07
100
80
Statistical power (percent)
RR single
60
RR double
RD single
40 RD double
20
0
1.00 1.05 1.10 1.15 1.20
Relative risk
DR single = gene1 dominant, gene2 recessive, single-gene test; DR double = gene1 dominant,
gene2 recessive, two-gene test; RR single = gene1 recessive, gene2 recessive, single-gene test;
RR double = gene1 recessive, gene2 recessive, two-gene test
80 Chapter 6
Figure 6.3 and 6.4 are consistent with the results from Figures 6.1 and 6.2
and further suggest that beyond risk value = 1.05, single-gene tests are no
longer as effective from a power perspective as two-gene tests. Furthermore,
the power of two-gene tests improves as the risk of the downstream gene
increases. This is exactly the opposite of the scenario for single-gene tests,
which decline in power as the risk of the downstream gene increases.
Discussion
Our investigation of epistatic scenarios involving low-risk loci indicates that
for a given locus, single-locus tests are not as effective as two-locus tests for
predicting associations if the risk value for a second interacting locus exceeds
1.051.12 (the crossover risk value varies depending on the genetic inheritance
properties of the pair of loci). In general, the power of two-locus tests to detect
associations improves as the risk value of the second locus increases, whereas
the power of single-locus tests progressively declines. Disturbingly, for certain
inheritance models and risk values, a true association between a locus and
phenotype can be entirely masked by a second interacting locus when using
single-locus tests. These finding are not unexpected and are consistent with
previous findings reported by others.6,24,25 However, single-gene models
continue to be used as the core methods for detecting associations in a GWAS
context. Our study is significant in that it provides a more exact estimate of the
risk scenarios in which single-locus models are inferior.
Comparing the performance of the three different two-locus tests evaluated
in this study, in most cases for EM1, the two-locus, case-control Pearson test is
optimal. In certain scenarios (i.e., when both genes have a dominant MOI), the
unlinked Wu et al. test (in which cases and controls are included) as refined
by Ueki et al. is optimal.11,12 This finding is somewhat surprising given that
the modified Wu et al. test measures interaction effects exclusively, whereas
the two-locus, case-control test includes main effects for both loci as well as
interaction effects.
Despite the widespread recognition that single-locus tests are likely to be
inferior to multilocus tests for GWAS of many diseases and phenotypes, an
unresolved issue is how to construct a computationally practical test that takes
into account interactions and enhances the detection of associations between
a specific locus and the phenotype of interest. Wang and colleagues conducted
an empirical comparison of five epistatic interaction detection methods,
including a number of two-pass methods.26 They indicate that each of the five
Conducting Genome-Wide Association Studies (GWAS): Epistasis Scenarios 81
Chapter References
1. Cooley P, Gaddis N, Folsom R, et al. Conducting genome-wide association
studies: epistasis scenarios. J Proteomics Bioinform. 2012;5(10):245-251.
2. Kuo CL, Feingold E. Whats the best statistic for a simple test of genetic
association in a case-control study? Genet Epidemiol. 2010;34(3):246-253.
3. Cooley P, Clark R, Folsom R, et al. Genetic inheritance and genome
wide association statistical test performance. J Proteomics Bioinform.
2010;3(12):321-325.
4. Manolio TA. Genomewide association studies and assessment of the risk
of disease. N Engl J Med. 2010;363(2):166-76.
5. Burdett T, Hall P, Hasting E, et al. The NHGRI-EBI Catalog of published
genome-wide association studies. 2015 [cited 2015 Nov 2]; Available from:
www.ebi.ac.uk/gwas
6. Li J, Horstman B, Chen Y. Detecting epistatic effects in association studies
at a genomic level based on an ensemble approach. Bioinformatics.
2011;27(13):i222-9.
7. Carlson CS, Eberle MA, Kruglyak L, et al. Mapping complex disease loci
in whole-genome association studies. Nature. 2004;429(6990):446-52.
8. Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting
multiple loci that influence complex diseases. Nat Genet. 2005;37(4):413-7.
9. Suhre K, Shin SY, Petersen AK, et al. Human metabolic individuality in
biomedical and pharmaceutical research. Nature. 2011;477(7362):54-60.
10. Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of
complex diseases. Nature. 2009;461(7265):747-53.
82 Chapter 6
Overview
In this chapter, we address a scenario that uses synthetic genotype case-
control data that are influenced by environmental factors in the context of a
genome-wide association studies (GWAS). The precise way the environmental
influence contributes to a given phenotype is typically unknown. Therefore,
our study evaluates how to approach a GWAS that may have an environmental
component. Specifically, we assess different statistical models in the context of
a GWAS to make association predictions when the form of the environmental
influence is questionable. We used a simulation approach to generate
synthetic data corresponding to a variety of possible environmental-genetic
models, including a main effects only model as well as a main effects with
interactions model. Our method takes into account the strength of the
association between phenotype and both genotype and environmental factors,
but we focus on low-risk genetic and environmental risks that necessitate
using large sample sizes (N = 10,000 and 200,000) to predict associations with
high levels of confidence. We also simulated different Mendelian gene models,
and we analyzed how the collection of factors influences statistical power in
the context of a GWAS. Using simulated data provides a truth set of known
outcomes such that the association-affecting factors can be unambiguously
determined. We also test different statistical methods to determine their
performance properties. Our results suggest that the chances of predicting an
association in a GWAS is reduced if an environmental effect is present and
the statistical model does not adjust for that effect. This is especially true if the
Chapter 7 is based on a study that was published in the RTI Press.1 Minor text edits were made to
this chapter. The analysis of the gene data has not changed.
86 Chapter 7
environmental effect and genetic marker do not have an interaction effect. The
functional form of the statistical model also matters. The more accurately the
form of the environmental influence is portrayed by the statistical model, the
more accurate the prediction will be. Finally, even with very large samples sizes,
association predictions involving recessive markers with low risk can be poor.
Introduction
In recent years, scientists and researchers have increasingly used GWAS to
unravel the genetic factors that influence important phenotypes such as disease
presence and predisposition. The hypothesis GWAS follows is that if genetic
variations are more frequent in people with a given disease, the variations
are likely associated with the disease. In general, GWAS apply univariate
statistical tests to each gene marker or single nucleotide polymorphism (SNP)
as an initial step. This SNP-based test is statistically straightforward, and the
core tests for assessing the associations are standard methods (e.g., 2 tests,
regression) that have been studied outside of and within the GWAS context.
Kuo & Feingold describe the most commonly used statistical methods applied
to GWAS.2 All the tests they cite are single-locus tests.
The popularity of the GWAS approach is testimony to its simplicity;
however, it obscures the important issue of whether a single-gene model
is conducive to unraveling the workings of the biosynthetic pathways of a
phenotype. In the preceding chapter, we demonstrate that if two genes are in
epistasis the likelihood of identifying the weaker (in terms of risk) of the two is
diminished if a single-gene model is used in this context. We would anticipate
a similar finding involving genes and environmental interactions namely that
the risk weaker effect would be dominated by the stronger effect.
Researchers can use classical statistical tests derived from case-control
experiments to determine whether two loci associate in a GWAS context. Both
a Pearson 2 test and tests involving logistic regression can be used to examine
for pair-wise interaction assumptions.
In this study, we focused on low-effect loci with low relative risks of
association with disease diagnosis, because the evidence suggests these are
common.3 Most GWAS report only small changes in disease risk (1.1 to 1.5).
It has also been reported that relative risks underestimate the true risk and the
corresponding effect size.4
Note that this chapter does not assess the multilocus scenario, which is the
focus of Chapter 8. Nor does it account for the scenario involving multiple
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 87
loci that are tested simultaneously. This scenario, also discussed in Chapter 8,
requires an adjustment to the p-value threshold via a Bonferroni correction.5
This simple procedure (dividing the p-value threshold by the number of test)
assumes statistical independence between tests which is not true. Hence the
correction represents an overcorrection, leading to higher-than-necessary
type II error rates. We note that the correction does not affect our type I or
type II assessment because all of our examples have been generated with
positive genetic associations (even if some of these associations are very
difficult to detect). Thus, all scenarios are associated, and the issue is whether
that association can be detected and the statistical procedures that perform
most effectively.
The word risk can have a variety of meanings. In an environmental
context, it means a hazard based on an exposure to a chemical or pollutant
such as tobacco smoke. In another context, risk is interpreted more narrowly
to mean the probability of an adverse consequence (e.g., an adverse event
such as a disease). The term environmental risk in this study is used broadly;
we define it as any process that contributes to a disease diagnosis that is not
genetic in origin. For example, environmental risks can represent exposure to
chemicals or pollutantsor a subjects age.
Our overarching goal was to identify which statistical methods best identify
genotype-phenotype associations when environmental effects also influence
the association. Detecting such associations is particularly difficult for
genetic variants with modest impacts on risk. Consequently, our experiments
specifically investigated scenarios involving low-risk genetic variants and
assessed whether environmental influences with varied levels of risk could
be a source of the missing heritability observed using single-gene models.6
Not surprisingly, our investigations demonstrated that the best statistical
method (with respect to statistical power) depends on whether there are
interactions between the genotype and environmental factors and how well the
specified statistical model matches the environmental effect associated with
the phenotype. In summary, the simulated data set provides a truth set for
assessing the sensitivity of the effect of the statistical method and the predicted
association. Establishing the genotype-to-phenotype connections without
using a simulation approach is difficult to impossible. Although our study
results demonstrate a number of obvious truths, a number of unexpected
results may lead researchers to more powerful statistical approaches that can
establish the validity of the simulation approach.
88 Chapter 7
Background
Many complex diseases (e.g., diabetes, asthma, cancer) are affected in part by
interactions between genes and environmental factors. However, investigators
conducting GWAS typically do not investigate the influence of environmental
factors as part of the GWAS process.
There have been several notable exceptions. For example, a study by Terry
and colleagues showed a significant interaction between smoking status and
the specific gene for lung cancer.7 Another study, by Stern and colleagues
found smoking status to be an effect modifier of the association between
a codon and the risk of bladder cancer.8 Understanding the relationship
between genetic polymorphisms and environmental exposures can greatly aid
investigators in detecting high-risk subgroups in the population and provide
better insight into pathway mechanisms for complex diseases.
Current GWAS methods are designed to detect main effects, that is, direct
associations of an SNP or clusters of SNPs with disease.9,10 In the context of
complex diseases, examining main effects only could miss important genetic
variants specific to subgroups of the population.
SNP and a nutrient found in corn oil, which conveyed a 20 percent higher risk
than the SNP alone did.
A study by Murcray and colleagues performed a general methodological
study that focused on identifying SNPs that demonstrate heterogeneity
between subgroups defined by some environmental exposure.17 They
describe a two-step approach for detecting loci involved in gene-environment
interactions that is performed independently of any initial scans for
main effects. They expanded on the traditional test for gene-environment
interaction in a case-control study by incorporating a preliminary screening
step constructed to efficiently use all available information in the data. They
claim that their two-step approach is more powerful than the standard
test of interaction across a wide range of models and consequently is more
robust to changes in environmental exposure and minor allele frequency
than the traditional one-step test for identifying highly significant SNPs. The
difficulty with most methods, including theirs, is that it is not a data mining
method. The specific environmental factor and the form of that factor have
to be established prior to analysis. This has proven to be a difficulty with our
methods as well. The specific environmental factor or factors to include in the
model greatly affect the power of the tests. Specifically, researchers should use
some combination of the literature and/or data mining activities to establish
the form of the environmental effect model (step function or linear) on the
logistic scale.
A study by Cornelis and colleagues provides a comparative study of
several logistic regressionbased tests of gene-environment (G-E) and GE
interactions.18 All seven methods compared in their paper assumed a log-
additive mode-of-inheritance model for each SNP. This differs from our
methods, in which the mode of inheritance was agnostic. Cornelius et al. do
not identify a preference for any of the seven methods and instead indicate that
preference would depend on the goal of the study. They also explored methods
investigating environment effects only in subjects with a positive phenotype
case (i.e., case-only studies).
Finally, Kraft and colleagues performed a study, which was similar in
content to the Cornelius and colleagues study in that it also focused on log-
additive gene models, that formulated a likelihood ratio test of association
between disease and locus with the possibility that the genetic effect may be
modified by an environmental factor.19 The specific environment model they
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 91
Methods
Overview
We simulated genetic and environmental interactions in a GWAS context using
a qualitative association framework to determine which statistical methods
and models reliably predict associations between a qualitative phenotype
(specifically, a disease diagnosis, coded as case for a positive diagnosis or
control for a negative diagnosis) and a gene paired with an environmental
influence. As with our previous work, the concept of relative risk is the basis
for this investigation.20 We define the genetic relative risk () of a wild-type
genotype to be the ratio of the probability of a positive diagnosis given an
occurrence of a (wild-type) genotype divided by the probability of disease in
the absence of the disease genotype. We also define the environmental risk ()
as the ratio of the probability of a positive diagnosis given an exposure divided
by the probability of a positive diagnosis in unexposed subjects. The values of
and are specified exogenously and vary from low-risk to not-so-low-risk.
We generated 1,000 replicates of simulation data that depended on the two
risk values ( and ) for each of three gene models using a standard Bernoulli
process and analyzed them in terms of the observed power profiles for a low
alpha error ( 10-8). The distribution of the number of alleles per genotype
was randomized across replicates and was based on real data from the study by
Schymick and colleagues used in previous chapters.21 We biased the risk levels
92 Chapter 7
to the low end of the risk continuum because these are more difficult scenarios
and are typical of what has been observed in the literature.3 To support these
low risk levels, we fixed our sample size to N = 10,000 (5,000 cases and 5,000
controls) and N = 200,000 (100,000 cases and 100,000 controls) to determine
whether it is possible to measure associations in low-risk, recessive inheritance
scenarios. Other studies have used smaller values (N = 6,000) for comparable
investigations.13
From the selected relative risk (), penetrance (P), and MOI assumptions,
we used the formulas in Table 7.1 to assign a case (1) or control code (0). This
step converts the relative risk ratio () into the probability of a case (disease),
given the MOI gene model assumed.
Experiment 1The Main Effects Model. For the first experiment, half of
the population (selected at random and assigned 50 < E < 71) incurred an
EE relative risk (). The assigned risk value was 1.10, 1.20, 1.30, or 1.40. The
other half of the population (assigned 29 < E < 51) incurred no risk ( = 1.0).
Thus, Experiment 1 simulates a fixed EE. When the determinant risk variable,
E, exceeds a threshold, a positive diagnosis is more likely to occur. This is
identified as the fixed risk, main effects, no interaction model.
Experiment 3The Main Effects Log-Linear Risk Model. For the third
scenario, the entire population (randomly assigned 30 E 70) incurred an
EE relative risk () which was related to E in the following manner:
y = (E 30)/40.
= Xy (X to the y power), where X = {1.10, 1.20, 1.30, 1.40).
Experiment 3 simulates a log-linear variable risk model, with larger values of
E conveying additional risk levels. As in Experiment 1, there is no interaction
between the GI and EE risks.
For each experiment type, we varied the gene model to determine the
relative power differences across model specification. Overall, Experiment 1
data have a step function relationship to EE and no interaction or difference
in slopes (or EE step heights) across the three genotypes. In contrast, the
Experiment 2 data has a step function relationship with EE where the aa and
aA genotypes have the same slope (step height) but different intercepts. The
AA genotype relationship to the EE is flat or has zero slope (no step up). In
Experiment 3, the relationship to EE is log-linear, with equal slopes for all
three genotypes. Finally, in Experiment 4, the relationship to EE is log-linear;
the aa and aA genotypes have the same slope but different intercepts; and the
AA genotype relationship to the EE is flat, or has zero slope.
Statistical Models
All models tested assumed a logistic regression (LR) specification. This form is
commonly used in association studies involving environmental interactions.21
Table 7.2 shows the variables used in the different models.
We used three specific statistical models to assess the data generated by the
four experiments. Each assumed an intercept term and had the following form:
Model 1 is a logistic regression model with a single variable genotype (G)
main effect (2df). This is a candidate model if no environmental exposure
were suspected.
Model 2 is a logistic regression mixed main effects and interaction model
(g1, g2, E, g1E, g2E) (6df). This is a fully specified model that assumes
that the environmental exposure is a continuous variable.
Model 3 is also a logistic regression mixed main effects and interaction
model (g1, g2, e1, g1e1, g2e1) (6df). This is the fully specified
categorical model and assumes that the environmental exposure has a
specific (all-or-nothing) categorical variable form.
Table 7.4 summarizes the specific regression models we used in this study.
Note that we initially compared six models. Two were gene-only modelsa 1df
(log-additive test) and a 2df testand four were main effects plus interaction
models. We had two environmental exposure specifications (E and e1) and two
genetic inheritance specifications (G and g1, g2). From the six initial models,
98 Chapter 7
we selected the three models that dominated the other three: M-1, M-2, and
M-3. We dropped the other three models (M-1a, M-2a, and M-3a) from our
assessment.
The test statistics we used in our analyses are defined as the difference
between two log-likelihood (LLH) statistics. The first is specific to the model
used, and the second is based on a model with only the intercept term.
Results
Association Analysis
In this section, we describe the power profiles that result by applying the
models described in Table 7.4 to the data generated according to the four
different experiments in Table 7.3. We focus on detecting the associations
between the combined genotype-environmental factors on phenotype
outcome (disease diagnosis). We assess the importance of model specification
in predicting the presence of association with a phenotype of interest and to
what degree the gene model and genotype environment interactions influence
power. In the following section on genotype associations, we assess the role
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 99
Table 7.5. Power values, by statistical model, , and : Experiment 1all gene
models
Additive Dominant Recessive
M-1 M-2 M-3 M-1 M-2 M-3 M-1 M-2 M-3
1.10 1.00 .002 .004 .004 .000 .004 .004 .000 .000 .000
1.10 1.05 .002 .016 .022 .000 .040 .044 .000 .000 .000
1.10 1.10 .000 .160 .316 .000 .214 .354 .000 .050 .122
1.10 1.15 .002 .654 .882 .000 .728 .924 .000 .488 .810
1.10 1.20 .006 .986 1.00 .000 .992 1.00 .000 .968 1.00
1.15 1.00 .024 .046 .042 .000 .028 .024 .000 .000 .000
1.15 1.05 .028 .102 .138 .000 .082 .102 .000 .000 .000
1.15 1.10 .042 .378 .538 .000 .336 .468 .000 .038 .144
1.15 1.15 .052 .838 .948 .000 .832 .954 .000 .528 .806
1.15 1.20 .064 .996 1.00 .000 .994 1.00 .000 .958 .998
1.20 1.00 .246 .210 .206 .004 .100 .104 .000 .000 .000
1.20 1.05 .274 .308 .338 .002 .214 .240 .000 .002 .004
1.20 1.10 .308 .630 .746 .006 .552 .684 .000 .054 .142
1.20 1.15 .350 .908 .982 .016 .912 .972 .002 .586 .852
1.20 1.20 .394 .996 1.00 .028 .994 1.00 .004 .972 .998
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1, M-2, and M-3).
Figure 7.1 shows the data generated using the protocol for Experiment 1
for the additive gene model. Figure 7.1 includes the optimal model (M-3,
identified by the boldfaced cells in Table 7.5) and the model that does not
include an EE variable in its specification (M-1). The results presented in
100 Chapter 7
Table 7.5 and Figure 7.1 indicate that there is little difference in performance
between models when the risk of EE is not present.
0.8
Statistical power
0.6
0.4
0.2
0.0
1.10 1.10 1.10 1.10 1.10 1.15 1.15 1.15 1.15 1.15 1.20 1.20 1.20 1.20 1.20
1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20
Relative risk combination (upper number = ; lower number = )
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1 and M-3).
The results that Figure 7.1 and Table 7.5 show indicate the following:
The power profile of M-1 is substantially less than that of models M-2 and
M-3. M-1 represents a typical single-locus method used in a GWAS that
ignores environmental influences. We conclude that not including an EE
reduces the likelihood of the locus being associated with the phenotype.
M-3 is the most powerful of the three models. This is expected because the
Experiment 1 protocol should generate data consistent with the M-3 model
formulation.
The difference between the profiles of M-2 and M-3 is a result of the
manner used to characterize the EE functional form. Because the data were
generated in a manner compatible with the e1 variable used in M-3, that
model generated more accurate power predictions.
Note that in the full M-3, the overall intercept is the log of the intercept
for the line that relates EE to the log-odds risk among those subjects with the
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 101
nondisease genotype AA. The coefficient associated with the g1 main effect is
testing for the difference between intercepts for the subjects with genotype aA
and those with genotype AA. Similarly, the g2 main effect coefficient is testing
for the difference between the intercepts for subjects with the aa genotype and
those with the AA genotype.
The EE main effect coefficient is the height of the step in the step function
relating EE to log-odds-risk for subjects with the AA genotype, and it,
therefore, tests for a common EE step height across all three genotypes. The
g1E interaction coefficient is the difference between the step heights for
the aA subjects and the AA subjects. Similarly, the g2E coefficient is the
difference between the step heights for the aa subjects and the AA subjects.
Because the AA, aA, and aa step heights/slopes associated with the EE
environmental effect are all equal in Experiments 1 and 3, only the common
main effect (ME) associated with EE contributes to association prediction in
those data sets, and the interaction terms are superfluous.
Table 7.6 and Figure 7.2 show the results of applying the three models
that Table 7.4 describes to the data generated according to the Experiment 2
protocol (see Table 7.3). The results from Experiment 2 indicate the following:
Although M-1 does not adjust for EE, the observed (relatively) high power
profiles for high EE risk levels suggest that the GI-EE interaction effect is
embedded in the M-1 power values, and the high power profiles are credited
as a genotype main effect.
As in Experiment 1, M-3 outperforms all other models because the variable
e1 properly characterizes EE behavior. This clearly demonstrates the value
of preprocessing (i.e., mining) the data before committing to a specific
association model.
Table 7.7 and Figure 7.3 show the results of applying the three statistical
models described in Table 7.4 to the data generated according to the
Experiment 3 protocol (see Table 7.3). The results from Experiment 3 indicate
the following:
M-1 consistently performs below M-2 and M-3, indicating that not
including an EE term limits the association assessment.
In general, model M-2 produces better power profiles than M-3. This is
expected given that the EE incremental risk is linearly related to the log of
EE. Thus, model M-2 is more consistent with the protocol used to generate
the data in Experiment 3.
102 Chapter 7
Table 7.6. Power values, by statistical model, , and : Experiment 2all gene models
Additive Dominant Recessive
M-1 M-2 M-3 M-1 M-2 M-3 M-1 M-2 M-3
1.10 1.00 .002 .004 .004 .000 .004 .004 .000 .000 .000
1.10 1.05 .012 .086 .106 .000 .086 .094 .000 .000 .002
1.10 1.10 .114 .394 .462 .010 .408 .442 .002 .072 .116
1.10 1.15 .340 .702 .744 .078 .690 .738 .022 .376 .462
1.10 1.20 .640 .856 .914 .318 .844 .896 .104 .642 .714
1.15 1.00 .024 .046 .042 .000 .028 .024 .000 .000 .000
1.15 1.05 .166 .258 .290 .006 .212 .222 .000 .000 .004
1.15 1.10 .478 .634 .676 .082 .570 .590 .002 .090 .126
1.15 1.15 .710 .808 .832 .380 .846 .872 .082 .398 .464
1.15 1.20 .880 .952 .968 .730 .934 .954 .230 .654 .740
1.20 1.00 .246 .210 .206 .004 .100 .104 .000 .000 .000
1.20 1.05 .544 .518 .520 .084 .404 .426 .008 .002 .006
1.20 1.10 .760 .776 .784 .336 .752 .784 .052 .120 .162
1.20 1.15 .918 .926 .936 .736 .920 .944 .184 .418 .496
1.20 1.20 .978 .990 .996 .916 .984 .986 .345 .640 .725
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1, M-2, and M-3).
0.8
Statistical power
0.6
0.4
0.2
0.0
1.10 1.10 1.10 1.10 1.10 1.15 1.15 1.15 1.15 1.15 1.20 1.20 1.20 1.20 1.20
1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20
Relative risk combination (upper number = ; lower number = )
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1 and M-3).
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 103
Table 7.7. Power values, by statistical model, , and : Experiment 3all gene models
Additive Dominant Recessive
M-1 M-2 M-3 M-1 M-2 M-3 M-1 M-2 M-3
1.10 1.00 .002 .000 .004 .000 .006 .004 .000 .000 .000
1.10 1.05 .002 .008 .006 .000 .014 .014 .000 .000 .000
1.10 1.10 .002 .034 .030 .000 .044 .030 .004 .004 .000
1.10 1.15 .002 .152 .096 .000 .190 .120 .062 .024 .018
1.10 1.20 .004 .472 .296 .000 .500 .260 .258 .228 .066
1.15 1.00 .024 .036 .042 .000 .034 .024 .000 .000 .000
1.15 1.05 .030 .068 .058 .000 .048 .052 .000 .000 .000
1.15 1.10 .040 .152 .106 .000 .162 .118 .002 .002 .002
1.15 1.15 .046 .390 .278 .000 .384 .302 .048 .034 .012
1.15 1.20 .052 .650 .524 .000 .670 .508 .308 .278 .074
1.20 1.00 .246 .174 .206 .004 .114 .104 .000 .000 .000
1.20 1.05 .270 .274 .250 .002 .166 .150 .000 .000 .000
1.20 1.10 .286 .402 .376 .002 .344 .260 .004 .002 .002
1.20 1.15 .348 .658 .548 .012 .520 .490 .098 .058 .032
1.20 1.20 .378 .862 .718 .024 .796 .660 .299 .289 .107
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1, M-2, and M-3).
0.8
Statistical power
0.6
0.4
0.2
0.0
1.10 1.10 1.10 1.10 1.10 1.15 1.15 1.15 1.15 1.15 1.20 1.20 1.20 1.20 1.20
1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20
Relative risk combination (upper number = ; lower number = )
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1 and M-2).
104 Chapter 7
Table 7.8 and Figure 7.4 show the results for Experiment 4. They indicate
the results of applying the three models described in Table 7.4 to the data
generated according to the Experiment 4 protocol (see Table 7.3).
Table 7.8. Power values, by statistical model, , and : Experiment 4all gene models
Additive Dominant Recessive
M-1 M-2 M-3 M-1 M-2 M-3 M-1 M-2 M-3
1.10 1.00 .002 .006 .004 .000 .006 .004 .000 .000 .000
1.10 1.05 .008 .054 .006 .000 .068 .054 .000 .000 .002
1.10 1.10 .078 .236 .030 .006 .284 .226 .002 .014 .008
1.10 1.15 .220 .478 .096 .048 .500 .476 .008 .110 .078
1.10 1.20 .510 .672 .678 .148 .624 .632 .046 .314 .294
1.15 1.00 .024 .074 .042 .000 .046 .024 .000 .000 .000
1.15 1.05 .118 .208 .164 .002 .162 .130 .000 .000 .000
1.15 1.10 .384 .514 .476 .040 .454 .410 .000 .018 .016
1.15 1.15 .618 .688 .658 .232 .698 .684 .046 .162 .132
1.15 1.20 .820 .830 .838 .570 .802 .792 .144 .354 .328
1.20 1.00 .246 .250 .206 .004 .138 .104 .000 .000 .000
1.20 1.05 .464 .466 .406 .046 .354 .290 .004 .002 .000
1.20 1.10 .714 .714 .692 .222 .642 .624 .026 .040 .022
1.20 1.15 .892 .876 .864 .622 .816 .824 .136 .216 .170
1.20 1.20 .954 .940 .944 .848 .916 .930 .257 .317 .343
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1, M-2, and M-3).
Table 7.8 and Figure 7.4 show results that indicate the following:
Consistent with Experiment 2s results, model M-1 does not adjust for
EE, but because of the influence of GI-EE interaction effects, M-1 displays
higher power profiles for large EE risk levels.
As in Experiment 3, M-2 outperforms M-3 because it better characterizes
the EE by using the variable E (age) and further demonstrates the value
of preprocessing (i.e., mining) the data before committing to a specific
association model.
In the presence of GI-EE interaction effects, the genetic-only model (M-1)
performs better than anticipated.
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 105
0.8
Statistical power
0.6
0.4
0.2
0.0
1.10 1.10 1.10 1.10 1.10 1.15 1.15 1.15 1.15 1.15 1.20 1.20 1.20 1.20 1.20
1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20 1.00 1.05 1.10 1.15 1.20
Relative risk combination (upper number = ; lower number = )
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level.
Note: Table 7.4 defines the statistical models (M-1 and M-2).
Genotype Associations
The analysis in the previous section focused exclusively on composite associa-
tionsthat is, whether a specific gene plus an environmental factor associates
with a phenotype. As we noted earlier, our main interest was separating main
genetic effects from environmental effects and their interactions. To accomplish
this, we defined a total effect test (TOT) that adjusts for EE where
TOT = LLH [log (, g1, g2, e1, g1e1, g2e1)] LLH [log (, e1)] (7.5)
is the test we applied to the data generated by Experiment 1 and 2 protocols, and
TOT = LLH [log (, g1, g2, E, g1E, g2E)] LLH [log (, E)] (7.5a)
is the test that we applied to the data generated by Experiment 3 and 4 protocols.
TOT is the association test that measures genetic effects (main and inter
active) and is adjusted for the environmental effect.27 TOT simultaneously
measures whether the aA and aa intercepts are different from the AA intercept
and whether the aA and aa slopes are nonzero, given that the AA slope on EE
is zero. This test was used to test for association from all causes.
106 Chapter 7
INT = LLH [log (, e1, g1, g2, g1e1, g2e1)] LLH [log (, e1, g1, g2)] (7.6)
and
INT = LLH [log (, E, g1, g2, g1E, g2E)] LLH [log (, E, g1, g2)]. (7.6a)
The INT test subtracts the main effects for g1, g2, and EE from the TOT and
tests whether the EE steps (or slopes) for the aA and aa genotypes are different
from the corresponding EE step (slope) for genotype AA.
The final test measures the influence of the genetic main effects (ME).
is the test applied to the data generated by Experiments 1 and 2 protocols and
Table 7.9. Total effects test (TOT) power values, by risk profile, , and all
experiments and gene models, N = 200,000
Experiment 1 Experiment 2 Experiment 3 Experiment 4
TOT^ TOT^ TOT^ TOT^ TOT^ TOT^ TOT* TOT* TOT* TOT* TOT* TOT*
Rec Dom Add Rec Dom Add Rec Dom Add Rec Dom Add
1.10 1.00 .319 .601 .574 .319 .601 .574 .328 .573 .538 .328 .573 .538
1.10 1.05 .328 .602 .612 .523 .777 .827 .340 .619 .608 .719 .958 .968
1.10 1.10 .343 .604 .577 .806 .949 .953 .377 .636 .596 .990 1.00 1.00
1.10 1.15 .344 .613 .625 .958 .994 .999 .408 .684 .662 1.00 1.00 1.00
1.10 1.20 .358 .622 .637 .993 .999 1.00 .492 .709 .704 1.00 1.00 1.00
1.15 1.00 .341 .725 .739 .341 .725 .739 .336 .725 .713 .336 .725 .713
1.15 1.05 .363 .745 .769 .534 .918 .919 .367 .750 .766 .764 .995 .997
1.15 1.10 .375 .770 .773 .837 .986 .993 .427 .796 .828 .992 1.00 1.00
1.15 1.15 .358 .751 .776 .934 .999 1.00 .480 .832 .870 1.00 1.00 1.00
1.15 1.20 .351 .775 .787 .995 .999 1.00 .487 .861 .893 1.00 1.00 1.00
1.20 1.00 .444 .889 .915 .444 .889 .915 .439 .851 .904 .439 .851 .904
1.20 1.05 .456 .900 .918 .631 .986 .982 .490 .912 .950 .809 1.00 .999
1.20 1.10 .477 .897 .943 .854 .998 1.00 .514 .949 .949 .995 1.00 1.00
1.20 1.15 .471 .917 .935 .973 1.00 1.00 .594 .946 .975 1.00 1.00 1.00
1.20 1.20 .514 .925 .945 .996 1.00 1.00 .632 .961 .986 1.00 1.00 1.00
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level; TOT = total effects test;
gene models: Rec = recessive, Dom = dominant, Add = additive.
Note: ^ = TOT from equation 5; * = TOT from equation 5a; 10-8.
Table 7.11. Main effects (ME) power values, by risk profile ( and )all gene models,
N = 10,000
Experiment 1 Experiment 3
ME^ ME^ ME^ ME* ME* ME*
Rec Dom Add Rec Dom Add
1.10 1.00 .123 .463 .434 .139 .450 .417
1.10 1.05 .131 .478 .476 .156 .479 .437
1.10 1.10 .138 .491 .459 .138 .476 .399
1.10 1.15 .138 .523 .500 .149 .498 .478
1.10 1.20 .154 .507 .495 .167 .506 .487
1.15 1.00 .153 .626 .628 .147 .608 .585
1.15 1.05 .179 .642 .664 .146 .619 .659
1.15 1.10 .193 .676 .660 .177 .654 .675
1.15 1.15 .163 .670 .670 .191 .653 .675
1.15 1.20 .171 .668 .682 .182 .676 .695
1.20 1.00 .204 .805 .850 .239 .755 .834
1.20 1.05 .244 .824 .850 .236 .807 .861
1.20 1.10 .250 .829 .887 .239 .817 .866
1.20 1.15 .264 .840 .873 .274 .816 .867
1.20 1.20 .315 .854 .893 .268 .830 .878
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level; ME = main effects; gene
models: Rec = recessive, Dom = dominant, Add = additive.
Note:^ = ME from equation 7; * = ME from equation 7a; 10-2.
Table 7.11 presents the results of the ME test for Experiments 1 and 3. ME
results for Experiments 2 and 4 are not shown because they were generated
by a protocol that produced EE and GI interactions, and if the INT test
demonstrated significance (as it should have), the ME tests would have been
unnecessary. In all cases, the alpha threshold was set to 10-2.
110 Chapter 7
The results shown in Table 7.12 suggest that the GI-EE interactions are
very sensitive to low EE levels ( < 1.10). They also accurately characterize an
interaction power value of zero when = 1.00, (i.e., no EE risk).
Table 7.13. Main effects (ME) power values, by risk profile ( and )all gene models,
N = 200,000
Experiment 1 Experiment 3
ME^ ME^ ME^ ME* ME* ME*
Rec Dom Add Rec Dom Add
1.10 1.00 .68 .68 .68 .68 .68 .68
1.10 1.05 .67 .70 .73 .70 .70 .73
1.10 1.10 .65 .75 .68 .75 .75 .68
1.10 1.15 .65 .68 .72 .68 .68 .71
1.10 1.20 .77 .81 .69 .78 .79 .68
1.15 1.00 .60 .88 .94 .60 .88 .94
1.15 1.05 .65 .84 .92 .67 .85 .92
1.15 1.10 .61 .84 .92 .61 .84 .92
1.15 1.15 .75 .92 .97 .75 .92 .96
1.15 1.20 .65 .87 .94 .66 .85 .94
1.20 1.00 .75 1.0 1.0 .75 1.0 1.0
1.20 1.05 .73 1.0 1.0 .75 1.0 1.0
1.20 1.10 .81 1.0 1.0 .80 1.0 1.0
1.20 1.15 .81 1.0 1.0 .81 1.0 1.0
1.20 1.20 .85 1.0 1.0 .85 1.0 1.0
= genetic inheritance (GI) risk level; = environmental exposure (EE) risk level; ME = main effects; gene
models: Rec = recessive, Dom = dominant, Add = additive.
Note:^ = ME from equation 7; * = ME from equation 7a; 10-8.
These results suggest that for very large studies it is possible to predict
positive associations between recessive genes linked to phenotypes with low to
moderate risk.
Conclusions
In summary, the chances of predicting an association in a GWAS are reduced if
an environmental effect is present and the statistical model does not adjust for
it. This is especially true if the environmental effect and genetic marker do not
have an interaction effect. The functional form of the model also matters. The
more accurately the form of the environmental influence is portrayed by the
112 Chapter 7
statistical model, the more accurate the prediction will be. Even with very large
sample sizes, association predictions involving recessive markers are low.
This study focused on one important methodological step involved in
conducting a GWAS: selecting a statistical method and a supporting model
that reliably predict associations. This study does not address the broader issue
of the supporting experimental design that employs the statistical methods as
part of an overall solution strategy. Those combined issues and their mutual
interconnections are described by Cordell.28
The specific scenarios we address here involve genetic associations that
have environmental influences. Our assumption is that the environmental
influence that contributes to a given phenotype is in question and the precise
form of that influence is unknown. A separate analysis to characterize the
functional form to proxy the mechanism behind the environmental exposure
is required. These approaches should focus on case-only data similar to the
methods described in the Cornelis and colleagues study.18 These approaches
involve investigating different environmentally related functional relationships
between the suspected environmental influence and the phenotype in the
cases-only subpopulation. For example, if gene effects and environmental
effects are independently significant with respect to disease prevalence, a
polynomial model could be used to characterize the relationship between
environmental effects and the log-odds of disease prevalence. This would
allow us to test whether the nonlinear parameterization would be required
to characterize the environmental effect. Alternatively, if the environmental
effect has multiple levels such as age, researchers could investigate a cubic
polynomial to assess whether the effect stayed low initially then rose at some
point and flattened out toward the end of the environmental effects range.
If this analysis suggests an appropriate polynomial level for environmental
effects, researchers should also investigate a similar assessment using the gene-
environment interaction variable.
We have used this simulation scenario in previous studies. We reviewed
single-gene models and evaluated a wide class of statistical methods.23 Our
results indicated that researchers should consider a multitest procedure that
combines individual gene-based (dominant, recessive, additive) core tests as a
composite statistical method for conducting the initial screen in a GWAS. The
tests can be combined into a single operational test in a number of ways. Two
such tests are Holms Bonferroni procedure and the MAX procedure described
by Li and colleagues.29,30 Of course, if the gene model under investigation is
known, a single test that assumes the implied inheritance form is better than
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 113
a combined test. For this study, all patterns across gene models are consistent
and only vary by degree.
In Chapter 5, we have also evaluated the effect of phenotype errors that
resulted from inaccurate diagnoses and genotype errors that resulted from
gene-chip errors or occurrences of DNA methylation altering gene expression
that associate a wild-type gene with the wrong phenotype outcome.20 Our
results quantify the relationship between genotype and diagnosis error
measures and sample size to achieve a .80 statistical power level. Our results
also demonstrate that researchers should not underestimate the need to
increase sample size to compensate for power loss due to the presence of
genotype and diagnosis errors.
In Chapter 6, we also investigated epistatic scenarios involving two genes.31
The results showed that the most powerful statistical methods for predicting
associations between phenotypes and genotypes in epistatic scenarios are
statistical models that simultaneously test for associations involving both
interacting loci. This is consistent with the results we present here. This result
is not surprising and has been reported by others. We reported that if two
genes contribute to a phenotype, the weaker gene will be obscured by the
stronger gene and often will not be identified as a contributor to the phenotype
when a single-gene model is used. Again, this result is similar to showing
that the effect of an environmental exposure can obscure the influence of a
genotype-phenotype association if the model does not account for the GI and
the EE simultaneously. In this sense, two-gene models (or alternatively a gene-
environment model) produce better predictions of association than single-
gene models do.
We acknowledge that our results could possibly depend on the particular
experiments we devised to investigate how the statistical models performed. In
light of this, we are reviewing other scenarios to establish the robustness of our
findings. Nevertheless, establishing the genotype-to-phenotype connections
without using a simulation approach is limited.
For the gene-environment interaction scenarios addressed here, the results
across all gene models lead us to conclude that using a composite test that
supports distinct underlying statistical modelsthat is, a main effectsonly
model and a main effects with interactions modelis likely to be more
effective than single-model tests. This result does not depend on the gene
model and thus differs from the single-gene and epistatic scenarios, in which
each different gene model assumption (i.e., recessive, dominant, or additive)
requires representation in the composite test.20,29
114 Chapter 7
Chapter References
1. Cooley PC, Clark RF, Folsom RE. Statistical methods that identify
genotype-phenotype associations in the presence of environmental effects.
RTI Press Publication No. RR-0022-1405. Research Triangle Park, NC:
RTI Press; 2014.
2. Kuo CL, Feingold E. Whats the best statistic for a simple test of genetic
association in a case-control study? Genet Epidemiol. 2010;34(3):246-253.
3. Suhre K, Shin SY, Petersen AK, et al. Human metabolic individuality in
biomedical and pharmaceutical research. Nature. 2011;477(7362):54-60.
4. Spencer C, Hechter E, Vukcevic D, et al. Quantifying the underestimation
of relative risks from genome-wide association studies. PLoS Genet.
2011;7(3):e1001337.
5. Dunn OJ. Multiple Comparisons among Means. Journal of the American
Statistical Association. 1961;56(293):52-&.
6. Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of
complex diseases. Nature. 2009;461(7265):747-53.
7. Terry PD, Umbach DM, Taylor JA. APE1 genotype and risk of bladder
cancer: evidence for effect modification by smoking. Int J Cancer.
2006;118(12):3170-3.
8. Stern MC, Johnson LR, Bell DA, et al. XPD codon 751 polymorphism,
metabolism genes, smoking, and bladder cancer risk. Cancer Epidemiol
Biomarkers Prev. 2002;11(10 Pt 1):1004-11.
9. Browning BL, Browning SR. Efficient multilocus association testing for
whole genome association studies using localized haplotype clustering.
Genet Epidemiol. 2007;31(5):365-75.
10. Zhao J, Jin L, Xiong M. Nonlinear tests for genomewide association
studies. Genetics. 2006;174(3):1529-38.
11. Lichtenstein P, Holm NV, Verkasalo PK, et al. Environmental and heritable
factors in the causation of cancer--analyses of cohorts of twins from
Sweden, Denmark, and Finland. N Engl J Med. 2000;343(2):78-85.
12. Pearce CL, Rossing MA, Lee AW, et al. Combined and interactive effects
of environmental and GWAS-identified risk factors in ovarian cancer.
Cancer Epidemiol Biomarkers Prev. 2013;22(5):880-90.
Assessing Gene-Environment Interactions in Genome-Wide Association Studies 115
Polygene Methods in
Genome-Wide Association Studies (GWAS)
Philip Chester Cooley and Ralph E. Folsom
Overview
The hope that the Human Genome Project would pave the way to detect
associations between genetic markers and common diseases such as heart
disease, diabetes, auto-immune diseases, and psychiatric disorders has not
achieved expectations.1 Genome-wide association studies (GWAS) have been
the dominant tool used in the exploration of the genome to determine the
associations between genomic regions and complex traits in the population.
There are two ways to improve the GWAS process. The first is to improve the
quality of the data; the next generation of sequence data will provide greater
genomic coverage and will soon be available to the scientific community. The
second is by improving the computational process. Issues such as multiple
testing and the influence of external (nondisease) variables that confound
experimental results limit the interpretability of the statistical results.
This chapter examines the utility of a computation process that considers
multiple markers that simultaneously associate with a phenotypic trait. This
computational process builds a single nucleotide polymorphism (SNP)
network using a stepwise association method that proceeds in stages adding a
single SNP at each stage until no further additions are called for.
Initially, the model used in a GWAS consisted of a series of single-locus
statistical tests, examining each SNP independently for association to the
phenotype. In Chapter 4, we examined this single-gene model and showed
that, if the mode of inheritance (MOI) modelwhether genes are dominant,
recessive, additive, or multiplicativeis unknown, then a Cochran-Armitage
composite (CA-C) test, which pools the possibility of three MOI outcomes into
a single test, is the most powerful single-gene model test. We also showed that
the classical case-control method of epidemiology (an MOI-agnostic test) will
never be as powerful as the CA-C test.
Because most GWAS assume that the phenotype and the genotype data are
error free, we examined the consequences of this assumption in Chapter 5 and
found that adherence to the assumption forecasts sample size requirements too
small to achieve reliable association predictions and undermines replication
across studies.
Because genes do not always act as single triggers but act in concert with
other genes, polygene analysis is an important methodology that represents the
next step forward if GWAS are to remain a viable exploratory tool. Single-gene
GWAS have detected associations with numerous diseases but explain little
about disease heritability. Polygenic models assume that thousands of genetic
variants could impact the phenotype. For example, in humans, height, skin
color, eye color, and weight are all phenotypes linked to polygenes.
Chapter 6 discusses epistatic effects in association studies. We describe a
simple polygene method: two-gene models that have been designed to act in
ways that depend on the pair of MOIs and how the pair are positioned relative
to each other. We show that in contrast to the single-gene model, two-gene
models that are MOI agnostic (i.e., the two-gene case-control model) are
more powerful than either single-gene models or models that treat gene-gene
interactions only.
Chapter 7 uses a similar approach as that discussed in Chapter 6, but we
replace one of the genes with a variable representing an external environmental
influence such as a chemical exposure or the effects of aging. We show that
detection of the genetic marker is more likely to occur if the external process is
represented in the statistical model even if the characterization is inaccurate.
This chapter also builds on the work of previous chapters by confronting the
issue of how to process general polygene models with an unknown number of
significant traits. We propose stepwise polygene analysis as a new strategy for
studying the association between large sets of SNP predictors and groups of
correlated phenotypes (i.e., outcomes). These data commonly arise in GWAS,
in which associations with a large number of qualitative and quantitative
phenotypes are investigated using hundreds of thousands of genetic markers.
Our strategy uses the log linear-logistic model framework suited to analyzing
qualitative trait responses. In the two-gene epistatic scenario, we showed
that single-locus models do not detect all of the markers that are part of
Polygene Methods in Genome-Wide Association Studies 119
the phenotype pathway. Other studies have also reported this finding.2
However, from a combinatorial perspective, polygene models can create an
overwhelming number of comparisons when using more than two loci. A
number of methods exist that have developed approaches to examine all-
possible two-locus combinations.3,4 However, the computational requirements
of these approaches is daunting.
Chapter 8 defines a new polygene approach: a multistep process that is
analogous to a stepwise association process. A major new contribution is
that the initial stage of our process examines all usable SNP autosomes using
the CA-C testwhich, as we showed in Chapter 4, is the most powerful of
the conventional tests used to predict associations in GWAS5to predict
each SNPs genetic inheritance properties. The accuracy of these inheritance
predictions determine the effectiveness of the method because this assumption
is used to estimate the type of wild-type alleles for each loci. Subsequent stages
statistically combine the SNPs from previous stages and examine whether any
of the unassigned SNPs are associated with the phenotype based on the SNPs
from previous stages. Because each stage builds on the previous one, we refer
to this method as the stepwise association method, and we use it to predict
polygene networks for a given phenotype.
To test the method, we generated sets of simulated data described in the
Generating SNP Data section. These data sets consist of SNP (genotype)
data for each subject and an associated phenotype (case-control) measure.
The interclass correlation of the genotype data is controlled to determine
how strongly the different SNPs resemble each other, and each SNP has a
designated MOI and risk level. Using a combination of SNP and genotype
information, we estimate the value that each SNP contributes to the probability
of a case and use Monte Carlo methods to assign a case-control measure.
Collectively, this constitutes a truth set of known outcomes and the factors
that affect genotype-phenotype association.6
Hence, we can assess the performance of each method to predict known
outcomes. In general, our results suggest that the process is effective for main
effect and interaction SNPs with modest effects, as measured by their odds
ratios (ORs), but variables with small effects present a challenge that will be
difficult to overcome even with very large sample sizes.
120 Chapter 8
Background
In the past 7 years, GWAS have been the most widely used tool in genetic
research. Yet over this period, the tools of our research efforts have failed to
unravel the complete biological architecture of genetic diseases.7 Researchers
have learned from the myriad GWAS conducted during the past few years
that the biggest effect sizes for associations between genes and phenotypes are
much smaller than anticipated. For example, in a genome-wide meta-analysis
of intelligence quotient (IQ) involving nearly 18,000 children, the largest
effect size accounted for only 0.2 percent of the variance.8 This suggests that,
in general, smaller effects will be barely detectable in GWAS and extremely
difficult to replicate. However, this finding conflicts with hundreds of candidate
gene and gene-environment interaction studies that find significant effects
using modest sample sizes. The finding also implies that many of the reported
markers are false positives that cannot be replicated, although some journals
now require that candidate gene papers include an independent replication.9
Identifying additional loci of small effect can be partially accomplished
by meta-analysis of multiple GWAS using stringent significance thresholds.
Furthermore, it is unlikely that GWAS will ever be powered to identify the
full spectrum of small effects. Several recent analytical approaches have
been developed to test whether common variants of extremely small effect
size could be combined to explain phenotype variation. This approach was
successfully applied to the schizophrenia gene SCZ. A polygene small-effect
SNP model explained approximately 35 percent of disease diagnosis variance
of an estimated total of 80 percent.10 In a second example, Yang and colleagues
estimated that 67 percent of the heritability of human height could be
explained by a polygenic model.11 Although both models fail to estimate total
heritability, the models do account for far more than that of known, validated
associations. However, these approaches are limited because they are unable
to identify the proportion that each marker contributes to trait variation: the
marginal effects contributed by each variable are impossible to estimate using
models with many genetic variables.12 Nevertheless, a polygenic analysis
defines a large set of variants with an unknown subset that affects phenotype.
Together, these represent the true underlying biology.
Performing polygenic analysis to understand the genetic basis of complex
traits leads to a systems biology perspective in which many perturbations of a
complex network contribute to the outcome of a complex trait phenotype. The
complexity is difficult or impossible to disentangle on a per variant basis. To
Polygene Methods in Genome-Wide Association Studies 121
establish the biology of the complex trait directly, researchers need a systems
genetics approach in which large sets of genetic variants and/or genes are
analyzed in an integrated way and incorporate functional data.13
Disease-Scoring Methods
Disease-scoring methods are one such approach that researchers have
explored. Combining multiple genetic markers into a single score that predicts
disease risk is a relatively new approach for associating SNPs with disease
in the context of GWAS. This approach has shown that some diseases have
a strong genetic basis, even if few actual genes have been identified. Disease
scoring has also revealed a common genetic basis for distinct diseases. The
scoring approach has been used to obtain evidence of genetic effects when no
single markers are significant, establish a common genetic basis for related
disorders, and construct risk prediction models. Published studies have
demonstrated that significant associations of polygenic scores have occurred
only in well-powered studies and that useful levels of prediction may occur
when predictors are estimated from very large samples, up to an order of
magnitude greater than any currently available. This suggests the need for
studies to use larger sample sizes.14
Many geneticists believe that better polygenic scoring methods will be
developed. Quantitative genetic techniques that estimate heritability using
DNA suggest that about half of the heritability can be detected using the
common SNPs that are currently genotyped on commercially available DNA
arrays, given sufficiently large samples.15 Economic improvements in whole-
genome sequencing will make it cost-effective to identify DNA sequence
variation of every kind throughout the genome.7
Studying environment effects is more difficult than studying genes because,
as Cooley and colleagues have shown, the environment is more complex than
DNA with its genotype 0,1,2 designation code.16 However, environmental
research likely could capitalize on the advances in whole-genome technology
to identify biomarkers of environmental influence and therefore help detect
genetic associations.7
value for a second interacting locus exceeds a relative risk of 1.051.12.1 The
crossover risk value varies, depending on the genetic inheritance properties of
the pair of interacting loci. In general, the power of two-locus tests to detect
associations improves as the risk value of the second locus increases, whereas
the power of single-locus tests progressively declines.
For certain inheritance models and risk values, a true association between
a locus and phenotype can be masked by a second interacting locus when
using single-locus tests. However, these findings are not unexpected and
are consistent with the findings of others.17-19 We also note that a study of
ALS subjects identified SNPs that, when paired with other SNPs, became
significant markers of ALS, although there were no indications of association
using single-gene models.20 Thus, some markers can only be identified if they
are paired with other markers within a polygene network that is part of the
phenotypes biological pathway.*
Multilocus analyses are not as straightforward as conducting single-
locus tests and present numerous computational, statistical, and logistical
challenges.21 Examining all pairwise combinations of SNPs using gene
chips that generate 500,000 to 1,000,000 SNPs is currently computationally
infeasible. One approach to this issue is to preprocess the SNPs set to
eliminate redundant SNPs. A way to accomplish this is to select a set of results
from a single-SNP analysis based on an arbitrary significance threshold and
exhaustively evaluate interactions in that subset. We incorporate this approach
into our stepwise polygene analysis method but acknowledge that selecting
SNPs to analyze based on main effects will prevent certain multilocus models
from being detected where the heritability is concentrated in the interaction
rather than in the main effects.
Our stepwise polygene analysis approach looks for a main effect at Stage 1
but also considers the possibility of interactions. The benefit of our method
is that it performs an unbiased analysis for interactions within the selected
set of SNPs. It is also far more computationally and statistically tractable than
analyzing all possible combinations of markers.
Bush and colleagues describe another strategy: restricting examination of
SNP combinations to those that fall within an established biological context,
* We use the term biological pathway to represent the biological reactions and the interaction
network in a cell where each reaction is identified with its enzyme, which in turn is coded by
certain genes. Some of the most common biological pathways are involved in metabolism,
regulating gene expression, and transmitting signals. Pathway analysis plays a key role in the
advanced studies of genomics.
Polygene Methods in Genome-Wide Association Studies 123
Methods
The Stepwise Algorithm
Our method proceeds in stages. At each stage, we add an unassigned SNP
to the network unless the test score is below the significance threshold. The
first step (Stage 1) makes an initial pass against all usable SNP autosomes to
identify the most highly significant SNP-phenotype association. The second
stage statistically combines the Stage 1 SNP with all original autosome SNPs
to identify the most significant SNP pair associated with the phenotype. This
stage uses a test for significance that is conditional on the Stage 1 SNP. We
then continue this process for triple SNPs, quadruple SNPs, and so on until
combining the loci produces no new SNP-phenotype associations. This process
is analogous to a stepwise regression process, in which networks of SNPs are
connected stage by stage until no new SNPs can be identified. Our process is
not an exhaustive polygene analysis, but it does assume that at least one loci
associated with the phenotype can be identified via a single-gene model.
Step 1 Details. The initial pass uses the CA-C test to examine all SNP
autosomes and apply a restrictive significance threshold to identify the most
highly significant SNP-phenotype association, known as the Stage 1 SNP.
The single-locus model used in Stage 1 is based on the CA-C test that used
an MOI agnostic model that combined the three versions (recessive, dominant,
and additive) of the classical CA-C method into a single test. Our previous
results indicate a multitest procedure that combines the results of individual
MOI-based tests is an effective method for initially filtering a GWAS. Note
that the CA-C method can be used to predict the SNP-specific MOIs.
Consequently, the first stage not only selects the Stage 1 SNP, it also tags all
other SNPs with an MOI, which is then used to predict ensuing stage-specific
SNPs.
SNP and the natural log of the likelihood including only the Stage i-1 SNPs.
This test score has approximately a 2 distribution with one degree of freedom
if the log odds ratio for the Stage i SNP is actually 0.
Note that the test used to identify the Stage i (i>1) SNP uses a logistic
regression (LR) approach and the construction of the test depends on the
stage. A log likelihood (LLH) criteria is used to estimate the LR parameters
and measure whether the Stage i SNP sufficiently differentiates between cases
and controls given the set of Stage i-1 SNPs at a specified significance level T.
Formally, for si to be a candidate SNP at Stage i, it must satisfy:
the MOI and the ODDS parameters, to determine the probability of a case.
Using a random number compared with the case probability, we assign the
case-control designation vector Z and the number of subjects N, which we also
specify exogenously and use to determine the number of rows in X and Z, as
Table 8.2 shows.
E(value) = ODDS
E(value) = ODDS
E(value) = ODDS
Parameter
Example 1
Example 2
Example 3
Value
Value
Value
Variable MOI
Intercept B0# -1.386 .25 -1.386 .25 -0.916 .40
SNP1 D B1 .2231 1.25 .2231 1.25 .0010 1.01
SNP2 R B2 .0 1.00 .0 1.00 .0953 1.10
3 D B3 .3365 1.40 .3365 1.40 .0953 1.10
4 D B4 .0953 1.10 .0953 1.10 .0010 1.01
5 R B5 .4055 1.50 .4055 1.50 .0 1.00
6 D B6 .1823 1.20 .1823 1.20 .0488 1.05
7 D B7 .1397 1.15 .1397 1.15 .0392 1.04
8 R B8 .0 1.00 .0 1.00 .0953 1.10
9 R B9 .2852 1.33 .2852 1.33 .0488 1.05
10 R B10 .0953 1.10 .0953 1.10 .0198 1.02
11 R B11 .2231 1.25 .2231 1.25 .0010 1.01
12 D B12 .3001 1.35 .3001 1.35 .0392 1.04
13 D B13 .0953 1.10 .0953 1.10 .0 1.00
14 R B14 .0 1.00 .0 1.00 .0 1.00
15 D B15 .0 1.00 .0 1.00 .0296 1.03
SNP1 SNP9 I B16 X X .2523 1.30 X X
SNP3 SNP4 I B17 X X .1823 1.20 X X
MOI = mode of inheritance; ODDS = SNP odds ratios; SNP= single nucleotide polymorphism.
#B
0 = log odds of penetrance.
process performs the same calculations with SNP2 and SNP3, and so forth. Thus,
the genomic distance between SNP1 and SNP2 will be approximately D and the
genomic distance between SNP1 and SNP3 will be D squared.
Completion of this step generates a design matrix, X, consisting of genotype
values (0,1,2). The next step, which calculates the case-control designation
proceeds according to the following steps:
1. Given that the Xijs are = {0,1,2}F, define a set of variables Yijs derived from
the Xijs in the following way:
a. If Xij = 2 and j is a recessive SNP, Yij = 1 otherwise Yij = 0.
b. If Xij > 0 and j is a dominant SNP, Yij = 1 otherwise Yij = 0.
c. Ignore (for the time being) additive and multiplicative SNPs.
2. Calculate the case score Wi = B0 + SUM (BjYij) for the ith subject.
3. Convert the score into a probability pi = exp(Wi) / (1.0+exp(Wi) for the ith
subject (pi is the predicted probability of a case).
4. Use pi to generate Zi .the designation of the case-control value i.e. if random
number (01) < pi Zi = 1 otherwise Zi = 0.
5. After the Z vector is generated, determine whether the calculation of the
Bjs are sufficiently close to the parameters in Table 8.2 that were used to
generate the data set by performing a logistic regression of the generated
data to see how well the estimated coefficients reproduce the values in Table
8.2.
6. Generate a measure of fit closeness with respect to the estimated Bis
relative to the parameters in Table 8.2 and save the corresponding design
matrix with the closeness measures. This will generate a number of
replicates that exhibit closeness scores, with the last one having the best
overall correspondence.
At this point in the process we have generated an X matrix and a
companion Y matrix, which indicates whether the genotype has a wild-type
allele. We also generated a vector Z that accounts for the collective ODDS
of the SNPS. Furthermore, the process has been checked by estimating the
coefficients (i.e., step 5 above) of the logistic regression model for a number of
replicates. We save only the replicates that match original ODDS specifications
for future analysis, which in our case will be to assess how the proposed
method performs.
Polygene Methods in Genome-Wide Association Studies 131
Results
Model Comparisons and MOI Predictions
The first example illustrates the approach using genotype data consisting of
SNP pairs that are paired at rates D = 0.0, 0.2, and 0.4. We assume that the
15 SNPs have risk levels that are consistent with those that Table 8.2 presents.
We then use two LR models to apply our method. The first model uses the
0,1,2 genotype codes as the independent variable codes. The second model
uses the MOI predictions to stratify the genotype data into wild-type and non
wild-type allele. We will refer to each step as a stage.
The Stage 1 MOI predictions are shown in Table 8.3. The missed predictions
predominate for those SNPs with low ODDS values and for LD = .4. For
example, with ODDS > 1.0, the missed predictions occur twice in the LD = .0
column (SNPs 10 and 13), once in the LD = .2 column (SNP 6) and three times
in the LD = .4 column (SNPs 5, 7, 11).
-5
-10
Model 1
-15 Model 2
log(.05)
-20
log(.01)
-25
-30
-35
-5
-10
Model 1
-15 Model 2
log(.05)
-20
log(.01)
-25
-30
-35
LD = linkage disequilibrium; SNP = single nucleotide polymorphism; MOI = mode of inheritance;
Model 1 = model without predicted MOI; Model 2 = model with predicted MOI;
log(.05) = .05 corrected confidence threshold; log(.01) = .01 corrected confidence threshold.
-5
-10
Model 1
-15 Model 2
log(.05)
-20
log(.01)
-25
-30
-35
LD = linkage disequilibrium; SNP = single nucleotide polymorphism; MOI = mode of inheritance;
Model 1 = model without predicted MOI; Model 2 = model with predicted MOI;
log(.05) = .05 corrected confidence threshold; log(.01) = .01 corrected confidence threshold.
134 Chapter 8
The results of Table 8.4 and Figures 8.1A through 8.1C suggest collectively
that differences exist between statistical models and that Model 2 outperforms
Model 1. Also, with the subject size N at 10,000, the predictive accuracy of
both models appears unreliable for those SNPs with an effect size measured by
ODDS 1.10.
Table 8.5. Build results for interaction example (Example 2) by LD levels = 0.0, 0.2, and 0.4
Corr =.0 Corr = .2 Corr = .4
ODDS ODDS ODDS
SNP ODDS MOI reconstructed reconstructed reconstructed
1 1.25 D 1.26 1.29 1.23
2 1.00 R .96 .99 1.02
3 1.40 D 1.41 1.42 1.40
4 1.10 D 1.13 1.16 1.03
5 1.50 R 1.48 1.55 1.52
6 1.20 D 1.24 1.24 1.19
7 1.15 D 1.15 1.15 1.16
8 1.00 R 1.02 .98 .97
9 1.33 R 1.33 1.35 1.36
10 1.10 R 1.10 1.11 1.09
11 1.25 R 1.20 1.25 1.30
12 1.35 D 1.34 1.37 1.38
13 1.10 D 1.12 1.08 1.11
14 1.00 R .97 1.00 .97
15 1.00 D 1.00 1.02 1.03
19 1.30 DR 1.31 1.27 1.26
34 1.20 DD 1.18 1.24 1.19
Correct Calls 10 10 8
Missed SNP 10 & SNP 13 & SNPs 5,11,13 &
Interactions Interactions Interactions
SNP = single nucleotide polymorphism; ODDS = SNP odds ratios; MOI = mode of inheritance.
-20
N = 10,000
-40 N = 20,000
N = 30,000
-60
N = 40,000
log(.05)
-80
log(.01)
-100
-120
The results of Figure 8.2 and Table 8.6 suggest that more than N = 10,000
samples are needed before the interaction is detected. However, one of the
interaction terms involving the recessive SNP (SNP9) was undetected even
with an N = 40,000.
Table 8.7. Results of low-effect genes by stage and correlated SNP pairs LD = .200 for
Model 2
log(p-value)
log(p-value)
log(p-value)
N = 100,000
N = 200,000
N = 400,000
ODDS
ODDS
ODDS
Stage
MOI
MOI
MOI
SNP
SNP
Discussion
This study has demonstrated that creating a set of synthetic genes with known
properties that can be analyzed in the context of GWAS experiments is
possible and informative. Because the properties of the genes are determined,
and it is possible to establish exploratory protocols that define computational
best practices. At least seven other similar approaches have been reported
in the literature; however, the creation process we used in these experiments
controls for LD, sample size, MOI, the single locus main effect risks, and the
interaction loci risks, which other similar approaches do not.
Our synthetic gene procedure is designed to examine the properties of
defining a network of genes that link to a single phenotype. The method for
establishing this gene network is straightforward. It proceeds in stages and
uses logistic regression methods at each stage except Stage 1. Each stage uses
a maximum likelihood ratio test to identify the optimal locus to include in
the network, given the loci that have accumulated in the network in prior
stages. The Stage 1 calculation comprises the composite test that uses the three
variations of the Cochran-Armitage test. Each variation assumes one of three
MOIs and permits an estimate of the individual loci MOI. The MOI prediction
is then used in subsequent stages to improve statistical power.
The results indicate that
a network of SNPs that links to a given phenotype can be identified if (and
only if) the study size is sufficient;
interaction effects can be addressed in this process but may require a larger
N in their detection;
Polygene Methods in Genome-Wide Association Studies 139
the statistical power of the method may be sensitive to the level of LD; and
a network consisting of very low-effect loci can be predicted accurately if the
study size, N, is sufficient, but the study size needed is far larger than that
employed by traditional GWAS. For example, to detect very low-effect SNPs,
we used N = 400,000. This is well beyond what is conceived as practical, but
future research and improvements might make it possible.
Many polygene methods seek to investigate a large number of SNP
combinations. These approaches are computationally demanding and often
impractical to implement. Even with widespread recognition that single-locus
tests are likely to be inferior to multilocus tests for GWAS of many diseases
and phenotypes, an unresolved issue is how to construct a computationally
practical test that takes into account interactions and enhances the detection of
associations between multiple loci and the phenotype of interest.
The main value of our study is that it proposes computations that are
practical and straightforward and require less computational effort than those
reviewed by others.3 Furthermore, our simulation studies suggest that if the
associations are real, they can be found with properly powered studies. An
additional feature of our study is that as part of the Stage 1 SNP selection, the
prediction defines the inheritance properties of all SNPs, which our simulation
experiments have shown to be accurate and can be used to improve the
reliability of ensuing stage predictions.
Chapter References
1. Visscher PM, Brown MA, McCarthy MI, et al. Five years of GWAS
discovery. Am J Hum Genet. 2012;90(1):7-24.
2. Chatterjee N, Wheeler B, Sampson J, et al. Projecting the performance of
risk prediction based on polygenic analyses of genome-wide association
studies. Nat Genet. 2013;45(4):400-5, 405e1-3.
3. Wang Y, Liu G, Feng M, et al. An empirical comparison of several recent
epistatic interaction detection methods. Bioinformatics. 2011;27(21):2936-
43.
4. Zhang X, Huang S, Zou F, et al. TEAM: efficient two-locus epistasis tests in
human genome-wide association study. Bioinformatics. 2010;26(12):i217-
27.
5. Cooley P, Clark R, Folsom R, et al. Genetic inheritance and genome
wide association statistical test performance. J Proteomics Bioinform.
2010;3(12):321-325.
6. Schymick JC, Scholz SW, Fung HC, et al. Genome-wide genotyping in
amyotrophic lateral sclerosis and neurologically normal controls: first
stage analysis and public release of data. Lancet Neurol. 2007;6(4):322-8.
7. Plomin R, Simpson MA. The future of genomics for developmentalists.
Dev Psychopathol. 2013;25(4 Pt 2):1263-78.
8. Benyamin B, Pourcain B, Davis OS, et al. Childhood intelligence is
heritable, highly polygenic and associated with FNBP1L. Mol Psychiatry.
2014;19(2):253-8.
9. Stranger BE, Stahl EA, Raj T. Progress and promise of genome-
wide association studies for human complex trait genetics. Genetics.
2011;187(2):367-83.
Polygene Methods in Genome-Wide Association Studies 141
substantial power loss). Of course, if the gene model is additive, a model that
assumes additive behavior is optimal with respect to power loss.
Making an informed assessment of the inheritance properties of the locus
is possible. Applying multiple tests that each assume a distinct inheritance
property and selecting the property that produces the lowest p-value is one
possibility. Knowing the inheritance will improve the performance of the
polygene process.
If we assume error-free phenotype data, we may need to supplement
the number of case/control subjects to achieve desired power targets
(substantially, if the locus is recessive). Adding noise to both phenotype and
genotype measures is informative in this context.
The GWAS process is exploratory in nature, and various tools, such as odds
ratio confidence regions and pseudo R-squared (R2) measures, are useful to
the exploration. However, odds ratio confidence regions and pseudo R2 are
limited to logistic regression model explorations.
It is possible to use the public databases that link SNPs to genes to diseases/
conditions and to develop interesting and plausible explanations of
genotype-phenotype linkages. In our experience, computing cell counts
often exposes potential biases due to low frequency of disease alleles.
Future Directions
GWAS have identified many new genetic risk factors for a number of
common human diseases, but much work remains to be done and can only be
accomplished by using new approaches. Furthermore, new technologies are
coming, such as whole-genome sequencing, which will replace the 1 million
SNP chip data with the entire genomic sequence of 3 billion nucleotides.
This will have a huge impact on data storage, manipulation, quality control,
and data analysis processes. New computer science and bioinformatics tools
will be needed, and cloud operations will become the new infrastructure.
Also, high-throughput technologies for measuring the transcriptome, the
proteome, the environment, and the whole-genome sequences will become
standard operating procedures. New phenotypes from the development of new
technologies such as neuroimaging will also become available and add to the
level of complexity.
Conclusions and Recommendations 147
Chapter References
1. Otto SP, Feldman MW. Deleterious mutations, variable epistatic
interactions, and the evolution of recombination. Theor Popul Biol.
1997;51(2):134-47.
Acknowledgment
The development of this book was made possible by the generous support
of the RTI Fellow program.
Contributors
Philip Cooley, MSc, is a senior fellow at RTI estimation research. In addition to his innovative
International in Computational Biology and High work on many complex survey efforts, Dr. Folsom
Performance Computing. He has more than has made significant contributions to the
47 years of experience in developing computer development of RTIs computer software for
models used in the study of environmental survey data analysis, SUDAAN.
health and disease transmission scenarios. He has Nathan Gaddis, PhD, is a research programmer/
developed infectious disease models to study the analyst in RTIs Research Computing Division. He
transmission of a number of diseases including has a diverse background in molecular biology
malaria, tuberculosis, HIV/AIDS, and influenza. He and bioinformatics, combining laboratory
is currently developing a process to forecast research experience in microbiology,
stunting in Indonesia, and his research focus is immunology, and genetics with genome-scale
shifting to the development and application of analyses, including genetic analyses for GWAS
models that explore and predict the impending and transcriptome-wide analyses of alternative
epidemic of chronic diseases. splicing. Currently, Dr. Gaddis is a co-investigator
Robert F. Clark, PhD, is a senior genetic on several GWAS projects, the lead developer for
epidemiologist in RTI Internationals Genetic the National Heart, Lung, and Blood Institute-
Epidemiology and Omics research program. funded LungMAP project, and a developer for the
Throughout most of his career, he has focused on National Human Genome Research Institute-
multidisciplinary work in the omics and systems funded PhenX project.
biology of various multifactorial disorders, and Grier Page, PhD, is a senior statistical geneticist
since 1992, he has conducted many genetic in RTIs Genomics and Statistical Genetics
studies of neurodegenerative diseases; breast, Research unit. He has been conducting research
bone, and brain cancers; nicotine, heroin, and in all areas of statistical genetics since 1993 and
cocaine addiction; and aortic aneurysms. in systems biology since 2002. Dr. Page has
Ralph E. Folsom, PhD, chief scientist in RTIs developed methods for the analysis of linkage
Division of Statistical and Data Sciences, is an and association data as well as microarray,
expert in the design and analysis of complex proteomic, and next-generation sequencing
probability samples. Working on the nations methods. Dr. Page has also developed numerous
largest household survey (the National Survey on bioinformatics tools to implement the methods
Drug Use and Health or NSDUH), Dr. Folsom he has developed, including the PowerAtlas
initiated innovative weight adjustment methods (http://www.PowerAtlas.org), HDBStat! (http://
based on his logistic response propensity and www.ssg.uab.edu/HDBStat), and CressExpress
exponential poststratification models. This (http://www.cressexpress.org).
pioneering work led to the sophisticated GEM Diane Wagener, PhD, was an epidemiologist at
weight adjustment methods currently employed RTI International. She has 37 years of experience
for NSDUH. Dr. Folsom also introduced in academia, government, and consulting
model-based imputations for missing frequency studying the causes, genetics, and social impact
of use and income data items, and he has been on a number of diseases, both in the United
an influential collaborator in the development of States and internationally. Her expertise is in
NSDUHs current Predictive Mean Neighborhoods research design and statistical and computational
(PMN) imputation methodology. Dr. Folsom has analysis of the data.
recently led RTIs innovative work in small area
Index
mode of genetic inheritance (MOI) (continued) Pearson correlation coefficient 22, 127, 128
environmental influencing factors studies, Pearson 2 test
92, 93t allele test and, 22
epistatic models of, 6970, 69t, 70t case-control genotype based on, 2021, 21t
for errors assessment, 5253, 54 replicating Schymick et al using, 7
genotype errors vs. phenotype errors and, results comparison, 23, 24t, 25
59 on two-gene epistatic models, 73, 76, 80
GWAS statistical test performance, 8-10, penetrance (P). See disease penetrance
40, 40t personalized medicine predictions, 4
log-additive model, 90 phenotype errors
polygene analysis, 128, 129t, 144, 146 GWAS simulated data and, 53, 54
predictions, in polygene analysis, 13132, maximum power loss due to, 59t
131t, 132f, 133f, 134, 134t, 13839 models of, 5152
SNP-specific, CA-C prediction of, 124 percent sample size increase to restore
synthetic gene database generation and, power and, 61t
3234 random misclassification, 50
multiplicative MOI gene data recessive MOI, statistical power by relative
error effects of GWAS simulated data and, risk and, 60f
53, 54 single-gene models and, 11
genotype and diagnosis errors impact on statistical power and, 55f
power and, 58f phenotypes
GWAS statistical test performance, 44t association detection of recessive genes, 107
Murcray CE, 90 combined genotype-environmental factors
and, 98101, 99t, 100f
environmental effects only studies of, 90
N
genotype vs., 1
National Health and Nutrition Examination
MOI gene model associated with, 9
Survey (NHANES), 89
poorly defined, ALS as, 25
National Human Genome Research Institute, 4
similarities, mating and, 127
National Institute on Aging (NIA), National
polygene methods (analysis)
Institutes of Health (NIH), 126
background, 12023
National Longitudinal Study of Adolescent
detecting low-effect SNPs, 13738
Health, 127
discussion, 13839
neuroimaging, 146
disease-scoring methods, 12122
New England Journal of Medicine, 6
general test, 14
generating SNP data, 12530
O
methods, 12430
odds ratio confidence regions, 146
model comparisons and MOI predictions,
odds ratio (OR). See also SNP odds ratio data
13134
definition, 4
models with interactions, 13437
in polygene methods, 119
overview, 11719
1 degree of freedom (1df) test, 22
positive association of n+1 SNP test, 145
gene-only log-additive test, 97, 98t
results, 13138
Online Mendelian Inheritance in Man
statistical approaches to, 12123
(OMIM), 3, 40, 126, 126t
stepwise algorithm, 13, 12425
ovarian cancer, variants influencing risk of,
polynomial level for environmental effects, 112
8889
PON1, gene linked to ALS pathogenesis in
sporadic ALS and, 17, 25
P
positive genetic association risk, in polygene
Patel CJ, 8990
analysis, 144, 145
PAWE-3D (statistical power calculator), 50
product moment, as correlation coefficient. See
Pearce CL, 88
Pearson correlation coefficient, 127
160 Index
U
Ueki M, 67. See also Wu et al test refined by
Ueki
UNC13A gene, ALS and, 19
V
validity, GWAS statistical methods and, 6
van Es MA, 19, 25
VAPB, gene linked to ALS pathogenesis in
familial ALS and, 17
variants
genetic burden of common diseases and, 6
GWAS detection of, 56
VEGF, gene linked to ALS pathogenesis in
sporadic ALS and, 17
W
Wang Y, 8081, 123
warfarin, genetic testing and dosages of, 4
whole-genome sequencing, 146
wild homozygotes (g2), in environmental
influence statistical models, 98
wild-type alleles (G), in statistical models, 98
Wu et al test refined by Ueki
two-gene epistatic models, 7374, 75, 76,
78, 80
two-locus models, 123
Wu X, 67
Y
Yang J, 120
Yu K, 89
Z
Zeggini E, 20, 22
Zhang X, 81, 123
Zheng G, 22, 40, 50, 72
Ziegler A, 33
This groundbreaking work uses a simulated data set to evaluate new analytic
methods in genome-wide association studies (GWAS). The human genome
is very complex, and the effect of a genetic variant depends on many factors
including where the gene is expressed, when it is expressed, how it interacts
with other genes that themselves may harbor variants, and the effect of
the environment. GWAS have identified many new genetic risk factors for a
number of common human diseases, but much work remains to be done and
can only be accomplished by using new approaches. The role of this book is to
help jump-start the investigation of such new approaches. Identifying which
computational strategy is best suited for investigating a specific aspect of
genomics is a daunting task. Using simulated data to evaluate new analytic
methods provides a truth set against which to assess methods predictive
properties. In this book, Cooley and colleagues use simulated data to test a
variety of analytic methods, starting with single-gene models and progressing to
more complex polygene and gene-by-environment scenarios. The methods that
Cooley and colleagues use are straightforward, easily applied, and thoroughly
documented.
RTI Press