Epigenome-wide association study
An epigenome-wide association study is an examination of a genome-wide set of quantifiable epigenetic marks, such as DNA methylation, in different individuals to derive associations between epigenetic variation and a particular identifiable phenotype/trait. When patterns change such as DNA methylation at specific loci, discriminating the phenotypically affected cases from control individuals, this is considered an indication that epigenetic perturbation has taken place that is associated, causally or consequentially, with the phenotype.
Background
The epigenome is governed by both genetic and environmental factors, causing it to be highly dynamic and complex. Epigenetic information exists in the cell as DNA and histone marks, as well as non-coding RNAs. DNA methylation patterns change over time, and vary between developmental stage and tissue type. The main type of DNAm is at cytosines within CpG dinucleotides which is known to be involved in gene expression regulation. DNAm pattern changes have been extensively studied in complex diseases such as cancer and diabetes. In a normal cell, the bulk genome is highly methylated at CpGs, whereas CpG islands at gene promoter regions remain highly unmethylated. Aberrant DNAm is the most common type of molecular abnormality in cancer cells, where the bulk genome becomes globally ‘hypomethylated’ and CPIs in promoter regions become ‘hypermethylated’, usually leading to silencing of tumour suppressor genes. More recently, studies on diabetes have uncovered further evidence to support an epigenetic component of diseases, including differences in disease-associated epigenetic marks between monozygotic twins, the rising incidence of type 1 diabetes in the general population, and developmental reprogramming events in which in utero or childhood environments can influence disease outcome in adulthood.Post-translational histone modifications include, but are not limited to, methylation, acetylation and phosphorylation on the core histone tails. These post-translational modifications are read by proteins that can then modify the chromatin state at that locus.
Epigenetic variation arises in three distinct ways; it can be inherited and be therefore present in all cells of the adult including the germline ; it can occur randomly and be present in a subset of cells in the adult, the amount of which depending on how early in development the variation occurs; or it can be induced as a result of behavioural or environmental factors. EWAS has previously associated changes in methylation with several diseases and complex conditions which do not have a known epidemiology and therefore are crucial for the identification of epigenetic factors that contribute to or are a consequence of pathogenesis of these diseases.
Methods
Types of Study Designs
Retrospective (case-control)
Retrospective studies compare unrelated individuals who fall into two categories, controls without the disease or phenotype of interest, and cases who have the phenotype of interest. An advantage of such studies is that many cohorts of case-control samples already exist with available genotype and expression data that can be integrated with epigenome data. A downside, however, is that they cannot determine whether epigenetic differences are a result of disease-associated genetic differences, post-disease processes or disease-associated drug interventions.Family studies
Useful to study transgenerational inheritance patterns of epigenetic marks. A main limitation of EWAS is deciphering if a phenotype is associated with epigenetic changes as a result of a variable in question or a result of previous genomic variants leading to epigenetic alterations. Comparisons between parent and offspring genomic and epigenomic data allows one to rule out the possibility that a disease or phenotype is due to genomic variation. A limitation of this study design is that very few cohorts which are large enough exist.Monozygotic twin studies
Monozygotic twins carry identical genomic information. Therefore, if they are discordant for a particular disease or phenotype it is likely a result of epigenetic differences. However, unless the twins are studied longitudinally it is impossible to determine if epigenetic variation is the cause of or consequence of disease. Another limitation is recruiting a large enough cohort of discordant monozygotic twins with the disease of interest.Longitudinal cohorts
Longitudinal studies follow a cohort of individuals over an extended period of time, usually from birth or before disease onset. Samples are taken and records are kept over many years, making these studies extremely useful to determine causality of particular phenotypes. Since the same individuals are followed at time points before and after disease onset, it removes the confounding effects of differences between cases and controls. Longitudinal studies are not only useful for risk studies, but also in intervention studies using pre- and posttreatment with specific exposures to investigate environmental impacts on the epigenome. A major disadvantage is the long timeline of the studies as well as the expense. Longitudinal studies using disease-discordant monozygotic twins gives the added benefit of ruling out genetic influences on epigenetic variation.Tissue of Interest
The strong tissue specificity of epigenomic marks presents a major challenge for the design and interpretation of epigenome-wide association studies. Tissue selection is constrained by both accessibility and the temporal stability of epigenetic patterning, yet robust EWAS require loci that are variable across individuals while remaining stable over time. In practice, disease-relevant tissues are often inaccessible, leading most EWAS to rely on DNA methylation measured in blood. However, methylation changes observed in blood are frequently difficult to interpret biologically, due to both variable cell-type composition and uncertain relevance to the target tissue. Moreover, the use of surrogate tissues implicitly assumes that interindividual epigenetic differences and environmentally induced changes are correlated across tissues—an assumption for which there is little empirical support. A solution to this fundamental limitation is to focus EWAS on correlated regions of systemic interindividual variation, which exhibit consistent interindividual methylation differences across multiple tissues. By targeting loci that are inherently shared across tissues, CoRSIV-focused assays reduce dependence on tissue availability, mitigate confounding from tissue specificity, and enable more reliable detection of environmentally or developmentally driven epigenetic associations without the need for serial sampling or disease-relevant tissue access.Quantification Method: DNA Methylation
The platform for epigenome-wide DNAm quantification utilizes the high throughput technology Illumina Methylation Assay. In the past, the 27k Illumina array covered on average two CpG sites in the promoter regions of approximately 14,000 genes and represented less than 0.1% of the 28 million CpG sites in the human genome. This falls short of being representative of the entire human epigenome. None of the early EWAS using this array used independent validation to verify the associated probes. An interesting observation was a bias in the differences between cases and controls towards non-CpG island probes, arguing strongly for the use of the latterly designed 450k array which does cover non-CpG islands with a higher density of probes. Presently, the Illumina 450k array is the most widely used platform in the last two years for studies reporting EWAS. The array still only covers less than 2% of the CpG sites in the genome, but does attempt to cover all known genes with a high density of probes in the promoters, but also covers with a lower density across the gene bodies, 3′ untranslated regions, and other intergenic sequences.Data Analysis and Interpretation
Site-by-site analysis
DNA methylation is typically quantified on a scale of 0–1, as the methylation array measures the proportion of DNA molecules that are methylated at a particular CpG site. The initial analyses performed are univariate tests of association to identify sites where DNA methylation varies with exposure and/or phenotype. This is followed by multiple testing corrections and utilizing an analytical strategy to reduce batch effects and other technical confounding effects in the quantification of DNA methylation. The potential confounding effects arising from alterations in tissue composition is also taken into account. Additionally, adjusting for confounding factors such as age, gender and behaviours that may influence the methylation status as covariates is conducted. The association results are also corrected for the genomic control inflation factor in order to account for the population stratification.Generally, mean levels of CpG methylation are compared across categories using linear regression which allows for the adjustment of confounders and batch effects. A P-value threshold of P < 1e-7 is generally used to identify CpGs associated with the tested phenotype/stimulus. These CpGs are considered to reach epigenome-wide significance. An effect size is also calculated at this significance level, indicating the difference in methylation when comparing two qualitative groups, or different quantitative values depending on your phenotype. CpG sites significantly associated with the phenotype and/or treatment/environmental stimulus are typically represented in a manhattan plot.